


les 
‘tor 
les 
‘tor 


ure 


4701 
1284 
-0000 
5161 


-1493 
-2458 
0000 
6313 
007 8 


"0000 
.1168 
2389 
1975 
0797 
-0000 
1095 





a 


_— 





ony Re NEN 


Journal of Experimental Education 








Volume V 


MARCH, 


1937 Number 3 








STATISTICAL ANALYSIS OF PERSONALITY RATING' 


Pan-LiIn CHI 
34 Hsing Hua Sih Street 
Peiping, China 


PART I 
CONSISTENCY 


INTRODUCTION 


Purpose of Study. Rating is one of the 
most widely used methods of studying charac- 
ter and personality. Since its procedure has 
not been well standardized, it has been used 
under various conditions. Raters have varied 
from casual novices to trained experts. Traits 
employed have varied from those which may 
be readily observed to those which are nearly 
abstract. Subjects rated have varied from 
infants to adults. And scales have varied 
from a graphic description of several definite 
categories to ranking in terms of relative posi- 
tions ina group. Results obtained under one 
set of conditions are usually different from 
those obtained under another set of conditions. 
Thus the value of this method is debatable. 

The purpose of this study is to ascertain 
the consistency and significance of the person- 
ality ratings made by teachers of their pupils 
in the Elementary School of the University of 
Chicago. The conditions under which the 
ratings were made were in some respects more 
favorable than those under which the ratings 
of previous studies were made: (1) the 
teachers were generally better trained in rat- 
ing methods; (2) they had closer observation 
and better acquaintance with the pupils; (3) 
the pupils were in the fifth and sixth grades, 
and they probably express their feelings and 
attitudes with less restraint than do adults; 
and (4) the traits for rating were defined in 
terms of actions which are perhaps more ac- 
cessible for observation than are the abstract 
names of traits. The analysis of these ratings 
may be expected to throw some light upon the 
real value of rating methods in general. 


*This article is a part of a dissertation submitted to the 
Department of Education of the University of Chicago in 
candidacy for the degree of of Phi hy. 

ry fz.) oh N. oman ant © 
Professor K suggestions and 


they gave dolce a this 


This paper consists of three parts: the first 
is concerned with the consistency of the rat- 
ings; the second is a study of the halo effect; 
and the last part is a factor analysis. 

Sources of Data. ‘This study is based upon 
the ratings of 100 pupils in two classes in the 
Elementary School. The pupils of each class 
were rated on personality traits by their teach- 
ers in the fifth grade and also in the sixth 
grade. These records are available in the 
Records Office of the Department of Educa- 
tion of the University. 

The first class of pupils entered the fifth 
grade in the fall of 1932 and the sixth grade 
in the autumn of 1933. The second class was 
in the fifth grade during the school year 1933- 
1934 and in the sixth grade during the fol- 
lowing year. Upon the elimination of those 
pupils who did not stay long enough to have 
all the rating records for the two years, com- 
plete data for one hundred pupils remained. 

Although the one hundred pupils did not 
pass each of the two grades at the same time, 
their data are comparable, as they were rated 
by the same persons in each of the two grades. 

Each of the pupils was rated, during the en- 
tire period, by five teachers, two of whom 
taught in the fifth grade, while the other three 
taught in the sixth grade. One of the fifth 
grade teachers taught social studies, and the 
other English and arithmetic. Of the sixth 
grade teachers, one taught French, the second 
social studies, and the third English and arith- 
metic. One teacher made two ratings a year, 
or one rating each semester. There were ten 
ratings for each pupil in the two years. 

The personality scale used consisted of 
twenty-six traits, each of which was defined 
in terms of several actions (ranging from one 
to eight). The pupils were rated with respect 
to the actions. Each action was analyzed on 
a graphic scale of five categories, namely, 
“never,” “rarely,” “occasionally,” “fre- 
quently,” and “consistently.” In the present 


229 








et ews meer 





230 JOURNAL OF EXPERIMENTAL EDUCATION 


investigation, the five categories have been 
designated by the five numerals a a 
“2,” and “1,” with “5” for “never’ + — 
for “consistently.” Thus the ratings ‘al 
been transformed from qualitative into quan- 
titative statements for the convenience of sta- 
tistical treatment. 


Only nineteen of the twenty-six traits have 
been used in this study. They are attentive- 
ness, mental alertness, intellectual curiosity, 
originality, industry, persistence, accuracy, 
carefulness, neatness, promptness, initiative, 
independence, leadership, self-assertion, facing 
reality, co-operation, sociability, emotional 
balance, and muscular coordination. The 
other traits were discarded, because the rat- 
ings concerning them either were not discrim- 
inative or were incomplete. 


The rating on each of the nineteen traits 
was obtained by averaging the estimates re- 
garding the actions in terms of which it was 
defined. The composite rating of one pupil 
was obtained by averaging his ratings on the 
nineteen traits. The rating of one pupil for 
one year is the average of his two ratings for 
the two semesters of the year. In this way 
the data for each pupil were recorded on the 
nineteen traits and for their composite, for 
each semester and also for each of the two 
years. 





[Vol. 5, No. 


b) 


METHODs OF INVESTIGATION 


The consistency of the ratings of the pres- 
ent study has been determined in two ways: 
first, in terms of the correlation between the 
rating and re-rating on each of the nineteen 
traits and their composite by each of the five 
teachers; secondly, in terms of the correlations 
between every two of the five teachers. The 
former indicates how consistently each teacher 
judges; while the latter determines to what 
extent one teacher agrees with another in 
judging the same traits. 

Lastly, the correlation between the rating 
by each of the five teachers and the consensus 
of the other four has been computed. This 
correlation is sometimes used as a validity co- 
efficient for evaluating ratings. 

The correlation coefficients throughout this 
study have been calculated by the product- 
moment method, and for the convenience of 
computation Holzinger’s formula 33 has been 
used.” 


RELIABILITY BETWEEN RATING 
AND RE-KATING 


The reliability coefficient between rating 
and re-rating by each of the five teachers with 
respect to each of the nineteen traits and their 
composite is presented in Table I. 


* Karl J. Holzinger, Siatistical Methods for Sea in Edu- 
cation, p. 146. New York: Ginn and Co., 1 


TABLE I 


RELIABILITY COEFFICIENT OF EACH OF FIVE TEACHERS WITH RESPECT TO EACH OF NINETEEN 
TRAITS AND THEIR COMPOSITE 





Teacher 
Trait 
A B Cc D E 

Attentiveness...________. es . 750 . 695 . 892 . 893 . 850 
Mental alertness ____. : . 858 . 653 . 904 . 877 . 795 
Intellectual curiosity. __________ . 851 . 789 . 923 exalt . 678 
Originality. _._.____- . 775 . 816 . 926 . 836 aan 
i as —e . 647 . 609 . 813 . 867 . 752 
Persistence______- ae ¢ . 769 .617 . 843 . 868 . 825 
Accuracy -__-_- 800 . 822 . 694 . 890 . 831 
Carefulness_..___ ___. eae eo . 469 . 739 . 860 . 846 . 884 
IRE Sa . 795 . 834 . 928 . 855 . 839 
Promptness _ - Rept best» . 594 . 711 .914 . 732 . 861 
Initiative ____- em wewakwak. . 848 . 673 . 883 . 902 . 749 
Independence____ _ - j ce . 421 . 741 . 858 . 869 . 679 
Leadership. _ _- is aie koatdeetin de . 805 . 844 . 955 emer . 835 
Self-assertion._____- ee . 880 . 821 . 853 .911 . 829 
Facing reality __- : . 817 . 585 . 879 . 879 . 778 
Co-operation _ - P is . 708 . 766 . 872 < . 854 
Sociability _ _ _ _ _- gia! Sexo . 937 . 740 . 933 . 816 . 859 
Emotional balance a = . 836 . 665 . 933 . 922 . 823 
Muscular coordination.________. . 879 . 768 . 876 .911 . 846 
Composite __ _ _- . 924 . 809 . 924 . 938 . 888 

EE a ae P . 770 . 735 . 883 . 871 . 813 


*The data are not complete. 


7 
t 

















yres- 
ayS: 
the 
teen 
five 
ions 
The 
cher 
vhat 
r in 


iting 
nsus 
This 
’ CO- 
this 
luct- 


e of 
been 


iting 
with 
their 


| Edu- 


TEEN 











March, 1937| 





Among the five teachers, C shows the highest 
reliability, .883, while B shows the lowest re- 
liability, .735, on the average. Nevertheless, 
the range of variation among the five averages, 
namely, .148, is rather small. The five teach- 
ers may therefore be regarded as equally 
reliable. 


The average of the average reliability co- 
efficients of the five teachers is .812. It is 
much higher than .650, the average of the re- 
liability coefficients reported in three previ- 
ous studies of similar nature.* This seems to 
indicate that the favorable conditions under 
which the ratings were made have increased 
the reliability. 


The coefficients of each teacher varied 
greatly with the traits; the highest was over 
twice as large as the lowest. In view of the 
fact that there was no common tendency for 
the teachers to show a high reliability on cer- 
tain traits and a low reliability on others, this 
variation does not appear to be due solely to 
a difference in accessibility to observation of 
the various traits, which was suggested as the 
important factor in some of the previous 
studies. Another factor suggests itself as one 
cause of this variation, namely, a difference 
in attitude or interest of a teacher toward the 
various traits. That is, a teacher would show 
a high reliability on that trait in which she 
had a keen interest and consequently made a 
real effort in its observation and estimation. 
On the other hand, she would show a low re- 
liability on another trait toward which she was 
indifferent and therefore tended to make no 
more than a perfunctory rating. 


Furthermore, the teachers differed markedly 
in the reliability of judgment as regards one 
trait. For instance, as shown in Table I, 
teacher A showed a reliability of .421 for in- 
dependence, while D showed .869 for the same 
trait; for emotional balance, B showed .66s5, 
while C showed .933; for intellectual curiosity, 
E showed .678, while C showed .923; and so 
on. This variation also may be accounted for 
by a difference in attitude of the teachers 
toward a trait. Theoretically, when one 
teacher obtains a high reliability on one trait, 
it should be possible for another teacher to 
obtain the same reliability. 

*W. Hardin Hughes, “Relation of Intelligence to Trait 


Characteristics,” Journal of Educational Psychology, XVII 
(1926), 482-94, 

A. W. Kornhauser, ‘“‘A Comparison of Raters,” Journal of 
Personnel Research, V (1927), 338-44. 

H. S. Conrad, “The Validity of Personality Ratings of Pre- 
ae Children,”’ Journal of Educational Psychology, XXII 
(1932), 671-80. 


ANALYSIS OF PERSONALITY RATING 231 


Another fact which may support the fore- 
going conclusion is that the traits vary only 
slightly in the reliability of rating, as meas- 
ured by the average of the five reliability co- 
efficients of the five teachers. As shown in 
Table II, the average coefficients of reliability 
of the traits range from .714 for independence 
to .877 for the composite, with an average of 
812. The range of variation, .146, is rather 
small. This seems to indicate that the traits, 
after being specifically defined, can be esti- 
mated with equal reliability, so far as rating 
and re-rating by one teacher is concerned. 

When the average of the five ratings for the 
first semester is correlated with the average of 
the five ratings for the second semester on 
each of the nineteen traits and the composite, 
the coefficients, as shown in Table III, range 
from .856 for originality to .946 for accuracy, 
with an average of .gog. This average is 
higher than the average reliability of each of 
the five teachers, which ranged, as mentioned 
before, from .735 to .883. This seems to con- 
form a finding, seporte 1 by Kornhauser,* that 
the average ratings are more reliable than the 
single ratings. 

TABLE II 
AVERAGE RELIABILITY COEFFICIENT OF FIVE 


TEACHERS WITH RESPECT TO EACH OF 
NINETEEN TRAITS AND THEIR 


COMPOSITE 

Trait r 
III tinh satis tiset hacntencyrererinsaiiinasenines 877 
a ee .860 
Muscular coordination __._......_----- .860 
I se eh lak Gite etal icalaian: diesem .859 
aa as es tera 857 
| RENE a ene cae .850 
tain cn sesnsciceetiniinipabemniniiblin 838 
Emotional balance -...........-.-.--- .836 
EES ees 817 
PCE A ERC ae .816 
Ne 811 
Intellectual curiosity ............----- .810 
| ENE EES Se ae a .807 
IID casio aiteinenneareinhtabaiitenniete .800 
EE -788 
SS ae a ae .784 
III sith ieenisisins cabinapietitadidicdina badaanlaaniiabaal -762 
ER SI RT LE SS EES -760 
IS tril ies innit Solin standin a eadtaticgainies -743 
er 714 

Ce eee 812 


The coefficient as high as .go9, which was 
found with the average ratings, is quite sig- 
nificant. It indicates that there is slight 


change, after one semester, in the relative po- 
sitions of the individuals in the group. In 


*A. W. Kornhauser, “Reliability of Average Ratings,’ Jour- 
nal of Personnel Research, V (1925), 309-17. 











ee 


232 JOURNAL OF EXPERIMENTAL EDUCATION 


other words, one pupil’s rating for the second 
semester can be predicted fairly well from his 
rating for the first semester. Therefore it 
may be concluded that the average of the rat- 
ings by the five raters is highly reliable and 
can be safely used for school purposes. 


CORRELATION BETWEEN RATINGS 
BY TEACHERS 


In regard to the correlation between every 
pair of the five teachers, ten coefficients have 
been calculated for each of the nineteen traits 
and for the composite. The average of the 


TABLE III 
RELIABILITY COEFFICIENTS BETWEEN THE AVER- 
AGE OF FIVE RATINGS AND THE AVERAGE OF 
FIVE RE-RATINGS WITH RESPECT TO 
NINETEEN TRAITS AND THEIR 





COMPOSITE 

Trait r 
BI, siicicc once aia ssanehenmnnnililidemtncntadoon .946 
I ado cacta ian tldpdinasucicennishcirtinths isan dapetaiicdirsen 945 
BS II nice eciincicmmmmnicemene -935 
I I die cacen:srepennepiniinlanebeiicnencinittin es .935 
i a a .934 
ne ERE ei ere .933 
IIE itcncstantine easseatnnpenipadanpmeiiaiadn 933 
RESETS ER Se wae ER .931 
eae .919 
I ae isan eten Jeutasinepiaatcbmeeh .914 
EE TE .906 
IRGUPOREONGD «noc ccncncncnaccccus 902 
Muscular coordination .............--- .899 
EE PE eee .893 
Intellectual curiosity ........-..------ 886 
A Aa ger Pt .886 
a i ell .886 
Emotional balance ~........---...---- .874 
hina -860 
2 eMlreNEr SEE reEr= .856 

Average -- a si -909 


ten coefficients for each trait and the com- 
posite is shown in Table IV. The average co- 
efficients range from .292 for independence to 
.676 for the composite, with an average of 
.468. The coefficient for the composite is 
higher than that for any of the nineteen indi- 
vidual traits. That is, the raters agree better 
in the summation of a number of judgments 
than in a single judgment. 

The coefficients vary greatly with the va- 
rious traits. The highest coefficient, namely, 
-627, is more than twice as large as the lowest 
one, namely, .292. The traits ranking high, 
with coefficients of .500 and above, are ac- 
curacy, neatness, attentiveness, initiative, care- 
fulness, mental alertness, self-assertion, and 
industry. The traits ranking low, with co- 
efficients below .400, are emotional balance, 
intellectual curiosity, sociability, facing real- 


[Vol. 5, No. 3 
TABLE IV 


AVERAGE CORRELATION COEFFICIENTS BETWEEN 
EVERY PAIR OF FIVE TEACHERS FOR 
NINETEEN TRAITS AND THEIR 


Trait COMPOSITE r 
CED cc.ckecdsimacnnncnnnesenmene .676 
BED cinetidintanatictidetntieninncetieiitisineaidr niet .627 
I eitiintsssdtaheteniasiibiiesecinnsiapihancndaics .590 
EET LAE LT 566 
heist as os nes dersantiigiaioaniocneomnisines 553 
INI ssieeniallastsincastienineinmion anduabebeneneeaiitits 530 
I IIIS sicnciccnemencitnicenetidindiandictbinianine 529 
TE 527 
EN 526 
IIIT siscuntainttindensdiciapasenceptianaiiastititnien 450 
IID -tilardubisvnsthitniec indeteanestens aaebeneatan 449 
IID spninissiainietaandinicsitiinetingsinndseitnmiiiace 433 
IIE. Goiutstsintntseiatnininetspuiieaintinienisnn 420 
a aiden cect taal ecslintannipines 401 
Muscular coordination _...._..._______ 401 
Emotiona] balance -..-_-....._-______ 895 
Intellectual curiosity .............___- 365 
ee 322 
IIIT, cccnivesceshenteuinetpibhmicseapmnanianstinds 321 
SII ' sitten coitininsininanwentsshitiehandvermatgieds .292 

I ich ai casciccineniccesbli dias aa tcatin 468 


ity, and independence. The other traits are 
in the intermediate ranks. This rank order 
seems to indicate that the teachers’ estima- 
tions correlate higher on the traits which can 
be more readily observed. It is to be noted 
that the specific definitions given to the traits 
failed to make the teachers’ estimations corre- 
late equally well with respect to the various 
traits. One of the reasons was found to be 
that the teachers rated not infrequently on 
different actions listed for one trait. The evi- 
dence will be given in the next section. 

The average correlation between the teach- 
ers, .468, is much lower than the average relia- 
bility, .812. And, quite contrary to expecta- 
tion, the average correlation obtained for these 
ratings is not so high as the average of the 
correlations (.550) reported in ten previous 
studies.° In view of these facts, there must 


* James B. Miner, “The Evaluation of a Method for Finely 
Graduated Estimates of Abilities,” Journal of Applied Psy- 
chology, I (1917), 123-33. é 

A. W. Kornhauser, “Reliability of Average Ratings,’’ Jowr- 
nal of Personnel Research, V (1926), 309-17. 

G. U. Cleeton and B. Knight, ‘Validity of Characte: 
Judgments Based on External Criteria,” Journal of Applied 
Psychology, VII (1924), 215-31. 

aul H. Furfey, “An Improved Rating Scale Technique,’ 
Journal of Educational Psychology, XVII (1926), 45-48. 

H. H. Remmers, N. W. Shock, and E. L. Kelly, “An Ex 
a Study of the Validity of the Spearman—Brown 

ormula Applied to the Purdue ting Scale,” ibid., XVIII 
(1927), 197-95. 

Edward Webb, Character and Intelligence, p. 18. British 
Journal of Psychology Monographs, No. 3 (1915). 

H. S. Hayes and D. C. Peterson, “Experimental Develop- 
ment of the Graphic Method,” Psychology Bulletin, XVIII 
(1921), 161-71. 

. Slawson, “The Reliability of Judgment of Personal 
Traits,” Journal of Applied Psychology, V1 (1922), 161-71. 

Richard S. Uhrbrock, “nee Tendencies of ag Se- 
lected Judges,” Journal of ucational Psychology, XXIII 
(1932), 594-603. 

Conrad, op. cit. 








WZ Om SP 


—e 
—_— bo 


SwmnOzrmwmr 








vo. 


Ge 


VEEN 


r 
.676 
.627 
.090 
566 
-053 
530 
.529 
527 
.526 
.450 
-449 
433 
-420 
401 
401 
.895 
365 
322 
321 
.292 
-468 

$ are 
order 
tima- 
1 can 
10ted 
traits 
orre- 
rious 
‘o be 
y on 
 evi- 


each- 
relia- 
ecta- 
these 
{ the 
vious 
must 
Finely 

Psy- 
 Jour- 
aracte! 
i pplied 
rique,’ 
‘a Ex 
-Brown 
XVIII 


British 


evelop- 


XVIII 


ersonal 


ly’ Se 
XXIII 





. 


ee 





March, 1937) 


be some factors which lowered the correlation 
between the teachers in the present study. 
Two possible factors suggest themselves: (1) 
The lack of completeness of the sub-ratings 
given by the teachers on the actions listed for 
many of the traits may have lowered the cor- 
relation between their estimations for the 
traits. (2) the different class-situations in 
which the teachers observed their pupils may 
have affected the correlation between their es- 
timations. These two factors will next be 
examined. 

Effect of Lack of Completeness of Sub-rat- 
ings. As mentioned before, each of the nine- 
teen traits (except accuracy, carefulness, and 
persistence) was defined in terms of several 
actions; and the rating for a trait was the 
average of the estimates on the actions in- 
cluded. Unfortunately, the teachers fre- 
quently omitted, for some reason, to rate the 
whole group on one or more actions included 
inatrait. In some cases, one teacher skipped 
one action, while another skipped another ac- 
tion under the same trait. On account of 
these omissions the rating by one teacher re- 
garding one trait often included sub-ratings 
which were different from those included in 
the rating by another teacher regarding the 
same trait. Therefore this incompleteness of 
the sub-ratings would lower the correlation be- 
tween the teachers to some extent. In order 
to investigate this effect, one single action 
under each trait was selected, and the sub- 
ratings by the teachers on each of these ac- 
tions were correlated. A comparison of the 
correlation for each of the single actions with 


ANALYSIS OF PERSONALITY RATING 233 


the corresponding correlation for the trait as 
a whole will show the effect of the incomplete- 
ness of the sub-ratings. 

The standard for choosing the actions under 
each trait was the completeness of the ratings 
by all the five teachers. The actions chosen 
are given below: 
Trait Under Which the 

Action Was Listed 


Attentiveness 
Mental allertness 


Action 


Attends to directions 
Is wide awake and 


interested 

Intellectual curiosity Asks questions 

Originality In composition 

Industry Works on assigned task 
without wasting time 

Neatness Neat in work 

Promptness Is on time and ready to 
work 

Initiative Does more than a mini- 
mum amount of work 

Independence Corrects his own work as 
far as he can 

Leadership Is a leader 

Self-assertion States his opinion with 
conviction 

Facing reality Accepts just criticism 
willingly 


Co-operation Enters into group activi- 


ties voluntarily 


Sociability Is quick to make friendly 
approaches 

Emotional balance Appears calm 

Muscular coordination In writing 


The average correlation coefficient between 
every pair of the five teachers for each of 
these actions is presented in Table V. The 
coefficients are, on the average, .376. This 
average is lower than for the heading traits, 
which is .446. 





TABLE V 


CORRELATION COEFFICIENTS BETWEEN TEACHERS FOR SIXTEEN SINGLE ACTIONS COMPARED WITH 
CORRELATION COEFFICIENTS FOR THEIR HEADING TRAITS 


Intellectual curiosity 
Originality 
Industry... ____- 
ee 
Promptness 
Initiative 
Independence__________- 
Leadership 
Self-assertion 
Facing reality 


Sociability 
Emotional balance 


i aiincnnbceniecacenkneeso<anse 





r for r for 
Single Action Trait Difference 
thirecniele . 229 . 566 —. 337 
Seaen . 376 . 529 —. 153 
aS Ie . 516 . 365 . 161 
tidiiaaaae . 520 . 420 . 100 
ee . 815 . 526 —.211 
OR Ae e . 372 . 590 —.218 
ee Ce, . 315 . 449 —. 134 
dinate aed 507 . 553 —. 064 
sandapekaseet 163 . 292 —.129 
sk te . 483 . 350 . 033 
Bn 2, ae . 552 . 527 . 025 
Be RI 138 . 321 —. 183 
BREW xp. 326 . 433 —.017 
rs: : 436 . 820 .116 
sipiphteardcad 454 . 895 . 059 
Sh aah ee 408 .401 . 007 
aaa neuen 376 . 446 —.064 








2 s 
——— 


—_ 


Foe y Sremeem, ga ree 





LED ET I ermine ey mc mee, ents 2 


234 JOURNAL OF EXPERIMENTAL EDUCATION 


The comparison of the coefficient for each 
of the single actions with the coefficient for 
its heading trait under which the action was 
included is also given in Table V. Seven of 
the sixteen coefficients show an increase of 
from .007 for the action of muscular coordina- 
tion to .151 for the action of intellectual curi- 
osity. The intermediate increases are for the 
action of originality, that of leadership, of 
self-assertion, of sociability, and of emotional 
balance. Nine coefficients show a decrease of 
from .025 for the action of self-assertion to 
.337 for the action of attentiveness. The in- 
termediate decreases are for the action of 
mental alertness, that of industry, of neatness, 
of promptness, of initiative, of independence, 
of facing reality, and of cooperation. 


The increases of correlation between the 
teachers regarding the single actions are sig- 
nificant, although they are small; while the 
decreases are to be expected. As discussed 
before, different raters correlate higher when 
a composite of several ratings is used than 
they do for a single rating. According to 
this theory, each single action should yield a 
lower correlation between teachers than its 
heading trait, since the latter is a composite of 
several single actions. On the contrary, the 


| Vol. 5» No. 3 


coefficients for the single actions are larger, 
in almost half of the cases, than the coeffi- 
cients for their heading traits; and the aver- 
age of the former coefficients, namely, .376, is 
close to that of the latter coefficients, namely, 
.446. This result may be taken as evidence 
for concluding that the incompleteness of the 
sub-ratings under some of the traits lowered 
the correlation between the teachers to a cer- 
tain extent. 

The neglect of estimating on some of the 
actions listed under a trait seems to indicate 
that raters tend to differ in what to observe or 
judge upon in a trait rating. 

Effects on Judgment of Different Class-Sit- 
uations. According to Hanna, different class- 
situations will cause teachers to judge differ- 
ently on personality traits. In order to as- 
certain this effect on the ratings of the present 
study, the teachers were matched in three 
groups according to certain aspects of re- 
semblance; and the correlation coefficients be- 
tween the ratings by the teachers within each 
group are to be compared. 

The first is a grade group which was com- 
posed of one pair of raters teaching fifth 

* Joseph V. Hanna, “Variable Factors Encountered in the 


Rating of Students,’’ School Science and Mathematics, XX\ 
(1925), 481-88. 





TABLE VI 


CORRELATION COEFFICIENTS BETWEEN TEACHERS MATCHED ACCORDING TO CERTAIN ASPECTS OF 
RESEMBLANCE, FOR NINETEEN TRAITS AND THEIR COMPOSITE 


Teachers of 





Teachers of Teachers of Different 
Trait Same Grade Same Subject Subjects in 
Different 
Grades 
r Rank r Rank r Rank 

Attentiveness._._______________- . 561 6 . 516 8 . 596 3 
Mental alertness._._____________- . 567 5 .471 12 . 621 7 
Intellectual curiosity..........._ .3388 19 . 395 14 . 390 16 
a since biases wg nthialine inten . 403 14 . 518 6.5 . 896 14 
Industry_________- ON nat eee . 534 8 . 575 4 .497 9 
eee . 433 12 .311 18 . 413 13 
sig a dacs Go sl cs, ae 4 . 683 1 . 655 2 
Carefulness__..______- picked kata . 499 9 .494 10 . 579 6 
nn esas . 590 2 . 605 3 . 583 4 
Promptness.................... .485 10 . 329 17 . 432 11 
ee dae . 556 7 . 493 11 . 582 5 
Independence_______________-_-- . 320 20 . 347 15 . 238 20 
is i ics ssa nil ac we dl . 421 13 . 518 6.5 . 459 12 
Self-assertion . __................ . 573 3 . 506 Q . 484 9 
ae . 371 18 . 236 19 . 391 15 
Co-operation... ...............-- .377 17 . 448 13 . 481 10 
ee oe, me isis . 389 16 .172 20 . 824 19 
Emotional balance______________ . 442 11 . 345 16 . 875 17 
Muscular coordination. ________-_- . 892 15 . 528 5 . 349 18 
I oS cards Shenae sus wrinen wise . 700 1 . 651 2 . 663 1 

SEE ae ee .474 = . 458 a .472 = 











et et Oe et et et oe et ot oe 


t 








irger, 
oeffi- 
aver- 
76, is 
mely, 
lence 
f the 
vered 
1 cer- 


f the 
licate 
‘ve or 


S-Sit- 
class- 
liffer- 
tO as- 
resent 
three 
yf re- 
ts be- 
| each 


com- 
fifth 
in the 
, XV 














Mare h, 1937 | 


grade and three pairs of raters teaching sixth 
grade. The second is a subject group which 
was made up with one pair of raters who 
taught English and arithmetic and another 
pair who taught social studies. The last or 
a heterogeneous group included four pairs of 
raters, each pair teaching two different sub- 
jects in two different grades. Thus there re- 
sulted four coefficients from the first group, 
two from the second group, and four from the 
third group. 


If the subject which a teacher taught influ- 
enced her attitude most in judging the person- 
ality traits of the pupils, the correlation be- 
tween the teacher of the subject group would 
be higher than that between the teachers of 
either of the other two groups. If the grade 
in which a teacher taught influenced her atti- 
tude most in judging the personality traits of 
the pupils, the correlation between the teach- 
ers of the grade group would be higher than 
that between the teachers of either of the 
other two groups. Lastly, if both the subject 
which a teacher taught and the grade in which 
she taught affected markedly her attitude in 
rating the pupils, the correlation obtained for 
the subject group and that for the grade group 
would both be higher than the correlation ob- 
tained for the heterogeneous group, composed 


ANALYSIS OF PERSONALITY RATING 235 


5) 


of teachers who taught different subjects in 
different grades. 

The average correlation coefficient for each 
group with respect to each trait is presented in 
Table VI. The three average coefficients, 
namely, .474, .458, and .472, are almost equal. 
Moreover, the rank order of the traits accord- 
ing to their coefficients as shown in Table VI 
varies only slightly with the different groups. 
Thus it is obvious that the subjects which the 
teachers taught and the grade in which they 
taught did not particularly influence their ob- 
servation and estimation of the personality 
traits of the pupils. 

This result is not in harmony with Hanna’s 
finding. The explanation is probably that 
the school subjects the teachers taught, 
namely, English and arithmetic, social studies, 
and French, are rather similar regarding ac- 
tivities of students, so that they have no dif- 
ferent effects on the teachers’ observation of 
the pupils’ personality traits. 

When the teachers of different subjects and 
of different grades rate in a similar way, their 
estimates on a trait will indicate the same per- 
sonality quality. The ratings by the different 
teachers can then be combined to obtain a 
higher reliability for practical purposes than 
that of one teacher. 





TABLE VII 


“INTERNAL” VALIDITY COEFFICIENTS OF THE RATINGS BY EACH OF FIVE TEACHERS FOR SIXTEEN 
TRAITS* AND THE COMPOSITE 





Teacher 
Trait 
A B Cc D E 

Attentiveness..._____- . 580 . 723 . 655 . 783 .712 
Mental alertness. _ ; . 651 . 647 . 652 . 661 . 690 
Industry _-__-_-- see — . 679 . 567 . 572 . 817 . 784 
Persistence __ _ - : : ze . 436 . 426 . 558 aan . 682 
Accuracy..........-- “al . 742 . 703 . 702 . 858 . 691 
Carefulness__. : . 593 . 498 . 800 . 699 . 730 
Neatness___ - : . 713 . 662 . 676 . 786 2 Fae 
Promptness - : . 580 . 433 . 633 . 684 .618 
Initiative___ _- : . 663 . 531 . 744 . 715 . 750 
Independence : ; .414 . 523 . 430 . 587 . 248 
Self-assertion ee : . 659 . 675 . 716 . 495 . 731 
Facing reality - ate : . 258 . 438 . 621 . 606 . 559 
Co-operation. _ _ . 430 . 667 . 5382 . 631 . 607 
Sociability___ __- 7 . 306 . 630 . 603 . 565 . 242 
Emotional balance _ roa Noe Tage . 494 . 481 . 567 . 483 . 650 
Muscular coordination F . . 592 . 437 . 453 . 596 . 648 
0 Ma . 746 . 744 . 766 . 824 . 804 

PR gniicb ds Heed oem deateds . 561 . 576 . 628 . 644 . 639 


*Intellectual curiosity, originality, and leadership to not appear in the table, because the data on 
them are incomplete. 








: 
ia 
: | 





eyenangys eten seems 


236 JOURNAL OF EXPERIMENTAL EDUCATION 


MAJORITY JUDGMENT AS THE CRITERION 
FOR VALIDATION 


The correlation between the judgment of 
each of the five teachers and the consensus of 
the other four may be regarded as the coeffi- 
cient of “internal” validity for each teacher. 
Since this criterion is based on internal con- 
sistency, the resulting coefficient does not tell 
whether the estimations are valid with refer- 
ence to an outside criterion. 

The coefficients of “internal” validity are 
presented in Table VII. Three traits, 
namely, intellectual curiosity, originality, and 
leadership, have been omitted from this sec- 
tion, because the data on them were not quite 
complete for the purposes of this part of the 
investigation. ‘The average coefficients of “in- 
ternal” validity range from .561 for teacher A 
to .644 for teacher D. This range of varia- 
tion among them, namely, .083, is quite small 
and insignificant. Thus the ratings by the 
five raters are equally valid as judged from 
the internal consistency criterion. The aver- 
age validity of the five raters, .610, indicates 
that, in general, one teacher’s judgment is 
rather close to the combined judgments of the 
other four. 


With the average validity .610 and the av- 
erage reliability .812 (Table II) for one 
teacher, the average of the judgments of the 
five teachers will yield a validity coefficient of 
-739 according to Holzinger’s formula 51.’ 
This figure is quite high. Theoretically only 
the average of ratings by an infinite number 
of raters can show perfect validity; but it is 
far from a practical possibility. The coeffi- 
cient of .739 may be interpreted to mean that 
the average of the ratings by the five teachers 
is fairly close to the true rating. 


SUMMARY 


1. The reliability coefficients for the five 
teachers, as measured by the average of the 
correlations between ratings and re-ratings 
with respect to the nineteen traits and their 
composite, varied from .735 for teacher B to 
.883 for teacher C. This range of variation 
being rather small, the five teachers may be 
regarded as equally reliable in judgment. 

The average of the average reliability co- 
efficients of the five teachers was .812. It is 
much higher than .650, the average of the re- 
liability coefficients reported in previous stud- 
ies. The favorable conditions under which 

* Op. cit., p. 170. 


[Vol. 5, No. 3 


the ratings were made appear to have raised 
the reliability. 


2. When the average of the five ratings for 
the first semester was correlated with that of 
the five re-ratings for the second semester, the 
reliability coefficients ranged from .856 for 
originality to .946 for accuracy, with an aver- 
age of .g0o9. ‘Tnis coefficient is quite high. It 
indicates that the average rating can be safely 
used for educational guidance. 


3. The average correlation coefficients be- 
tween the ratings by every pair of the five 
teachers on the various traits and their com- 
posite ranged from .292 for independence to 
.676 for the composite, with average of .468. 
‘The wide range of variation of the correlations 
for various traits indicates that the teachers 
agreed in different degrees in judging the vari- 
ous traits. The higher correlations appear 
to come with the traits which are more acces- 
sible to observation. 


The disparity between the average corre- 
lation, .468, and the average reliability, .812, 
is quite large. One of the factors causing the 
correlation between the ratings of different 
teachers to be too low was found to be the in- 
completeness of the sub-ratings on many of 
the traits. 


The different class-situations in which the 
teachers observed the pupils affected the 
teachers’ estimations little, if any. 

4. The “internal” validity of the rating by 
a single teacher, on the average, was .610, 
which is fair. The average ratings by five 
teachers would yield a higher coefficient, 
namely, .739. This coefficient may be inter- 
preted to mean that the average rating of each 
pupil is reasonably close to his true rating. 

5. On the whole, the estimations of the five 
teachers were quite reliable; but the correla- 
tions between them were not so high as might 
be expected. The disparity of the two phases 
of consistency seems to be due partially to 
the fact that raters differ in what they observe. 


PART II 
HALO EFFECT 


MEANING OF HALO EFFECT 


By halo effect is meant a systematic error 
of judgment which is usually found in ratings. 
This error is due to the fact that the rater’s 
general impression of a person rated affects 





Son 


Ma 


his 
oth 


ent! 
beil 
give 
the 
19¢ 
In 
rat 
Col 








March, 1937) 


his ratings with respect to specific traits; in 
other words, a rater can hardly judge a per- 
son with respect to specific traits independ- 
ently of his general impression of the person 
being rated. The name of this error was first 
given in 1920 by Thorndike.* Nevertheless, 
the error was discovered by Wells as early as 
1907 in his statistical study of literary merit.” 
In that study he had a graduate English class 
rate ten authors on ten literary qualities. 
Commenting on the halo effect, he says: 
There is noted introspectively a tendency 
to grade for general merit at the same time 
as for the qualities, and to allow an indi- 
vidual’s general position to influence his po- 
sition in the qualities. This would be the 
case especially in the case of those qualities 
that were ill-defined in the minds of the 
subjects and tended to be interpreted rather 
in terms of general merit This 
would make the correspondence of such 
qualities appear closer than they are. 


Webb pointed out the same error in his 
study of intelligence rating. He says, “The 
observers in estimating intelligence qualities 
are biased in the direction of marking subjects 
who possess other desirable qualities too 
highly and vice versa.”*® Thorndike re- 
ported, from a study by Knight, the correla- 
tions between general merit as a teacher and 
forty-five traits..* Most of them were be- 
tween .50 and .70. He concluded that these 
correlations were too high and too even, and 
that the ratings on the specific traits of a 
teacher must have been influenced by the 
rater’s general impression of her. The same 
error was also found by Rugg in his study of 
the Army Rating Scale.** He says, “We 
judge our fellows in terms of a general atti- 
tude toward them; and there is dominating 
this mental attitude toward particular quali- 
ties.’ Thus, it seems quite certain that halo 
is a reality in ratings, and that it is a trouble- 
some factor and tends to make ratings invalid. 

So far, however, there has not been pub- 
lished any study to investigate the size of the 
halo except the short paper of Symonds in 
1925.°° In that study, he had a group of 


_*E. L. Thorndike, “A Constant Error in Psychological Rat- 
ing,” Journal of Applied Psychology, IV (1920), 25-29. 

°F. L. Wells, A Statistical Study of Literary Merit, Ar- 
chives of Psychology, No. 7 (1909). 

"Op. cit., 70. “Op. cit. 

_@H. O. R “Is the Rating of Human Character Prac- 
— Jeuraat of Educational Psyckology, XIII (1922), 


2 “Hetes Rating,” Journal 
(1928). 2% ing,” J of Applied Psychology, IX 


ANALYSIS OF PERSONALITY RATING 


237 


pupils rated by two judges with respect to 
seven traits, which were honesty, obedience, 
courtesy, orderliness, cleanliness, sportsman- 
ship, and promptness. He took the average 
of the ratings of each pupil with respect to 
the seven traits as the general impression. 
After he had determined the correlation co- 
efficients between the two raters for each 
of the seven traits and the coefficient for 
the general impression, he used the tech- 
nique of partial correlation to eliminate the 
general impression from each of the seven 
correlations for the individual traits. Then 
for each trait he took the difference be- 
tween the raw and the partial correlations as 
the size of the halo effect in the rating on that 
trait. As a result, six of the seven correla- 
tions were found to have been raised by the 
halo by amounts ranging from .086 to .423, 
while one was lowered by .083. The halo ef- 
fect in the seven correlations was, on the av- 
erage, .245. 

Although Symonds’ study is suggestive, his 
method is not justified for two reasons. In 
the first place, the general impression which 
a rater had of each of the pupils might not be 
the average of his seven traits; it might quite 
possibly be something outside of them. Sec- 
ondly, his assumption that halo tends to raise 
the correlations between the ratings by two 
raters on individual traits is not in harmony 
with its definition. According to the forego- 
ing discussion, halo tends only to raise the 
intercorrelations of the ratings on different 
traits by one rater. Consequently, what he 
found cannot be properly regarded as the halo 
effect. 


METHOD OF INVESTIGATION 


According to the foregoing discussion, halo 
is the effect upon ratings of a general attitude 
of a rater toward a subject. This general at- 
titude affects his attitudes concerning the va- 
rious traits. Because of this effect the corre- 
lation between two traits, according to the rat- 
ings by one rater, tends to be higher than it 
should be. On the other hand, since two rat- 
ers are not likely to take the same attitude or 
to be under the same prejudice toward an in- 
dividual rated, the correlation between two 
traits, according to the ratings by two differ- 
ent raters, would be relatively free from the 
halo effect. Hence the difference between the 
former and the latter correlation coefficients 
may be regarded as the halo effect contained 
in the ratings by one rater. 

















+ ee a oe 


re 


as 


238 JOURNAL OF EXPERIMENTAL EDUCATION 


Theoretically, the correlation between two 
traits, according to the ratings by one rater, 
would be unity when the halo reaches its max- 
imum; as the larger the halo, the less the dif- 
ferentiation among the different traits of an in- 
dividual. And it would be zero when there 
is no halo effect, provided that the traits are 
really independent of each other and that they 
do not overlap in the rater’s conception of 
them. 


In order not to make this investigation too 
laborious to handle, three raters only have 
been included. They are A, B, and C, the 
three teachers of the sixth grade. 

The intercorrelations of the ratings on the 
nineteen traits by each of the three teachers, 
A, B, and C, are indicated by rq’, rgp’, and 
roc’, respectively. All these intercorrelations 
are supposed to contain the halo effect. The 
intercorrelations of the ratings by every pos- 
sible pair of the three teachers are indicated 
by Tas, BA, Tac) ‘ca, 7BC and Ton. All of 
these contain no halo. Each coefficient ra, is 
defined as the correlation between the ratings 
of two traits, the first of which was rated by 
teacher A, the second by teacher B. Each co- 
efficient 7,4 is the correlation between the rat- 
ings of two traits, the first of which was rated 
by teacher B, the second by teacher A. In 
like manner are fac distinguished from fca, 
and rpc from rep. The averages of the sets 
of six intercorrelations are defined as the 
“true” intercorrelations, in the sense that they 
are free from halo effect, and that, as resulting 
from the judgments of three teachers, they 
contain few errors of individual judgment. 

The method of determining the halo may be 
formulated as follows: 


Let rr—the “true” intercorrelations among the 
nineteen traits, 
Tu == the average of ras’, ren’, and roc’, 
u =the size of the halo contained in ru, or 
the average halo; 
ran-+ frat fac + roa + fac+ rer 





then rr = 6 aia oi 
2 = Yea f rae tee : Pitaes tas dkeweseuy (2) 
RF HFN occ st reser vcesawressscneseone (3) 

and, ra,’— rr = the halo contained in raa’,..... (4) 
rnp’ — fr = the halo contained in ras’,..... (5) 
tcc’ — fr = the halo contained in rec’,..... (6) 


DETERMINATION OF HALO 


The intercorrelations of the nineteen rat- 
ings by each of the three teachers, or raq’, 
Tgp’, and foc’, are not presented here because 
of limited space. Since the three sets of in- 


| Vol. 5» No. 3 
tercorrelations are all assumed to contain halo, 
the highest among the three would be from the 
rater whose estimations were most affected by 
halo, assuming that the intercorrelations 
would otherwise be equal. The average of 
the coefficients roc is .591, the average of ry,’ 
is .380, and the average of ra,’ is .380. 
Among the three teachers, apparently, C’s 
estimations contain the largest halo; while 
A’s and B’s are roughly on a par in this re- 
spect. The difference between the average 
Yoo’ and faa’ is .202; and the difference be- 
tween the average roc’ and rpp’ is .211. These 
differences appear quite significant, although 
the means of comparison is rather crude. 

The comparison of two correlation coeffi- 
cients with their probable errors does not ap- 
ply here, because each of the three averages is 
obtained from a group of 171 coefficients. 
The probable error of an average of a number 
of coefficients should be smaller than that of a 
single coefficient. Unfortunately the formula 
for its computation is not available. 

It is rather safe to say that the ratings by 
C contain a larger amount of halo than those 
by either A or B; the two latter teachers dif- 
fer only slightly in this respect. 

It should be noted that the variation among 
the teachers in the average size of the inter- 
correlations could be affected by their varia- 
tion either in reliability or in validity. In 
other words, the reliable and valid rater may 
tend to obtain a higher correlation in estimat- 
ing two traits than does an unreliable and in- 
valid rater. This effect, however, does not oc- 
cur in the present case, for the three teachers 
of the present study are similar, as discussed 
in Part I of this paper, both in reliability and 
in validity. 

In view of the fact that the intercorrela- 
tions for one rater may contain errors of judg- 
ment other than halo, it is more justifiable 
for the determination of the halo to use rx, 
the average of the three sets of intercorrela- 
tions by the three teachers separately, since 
the average intercorrelations may be consid- 
ered as near a typical set of intercorrelations 
for one teacher and they contain few errors of 
individual judgment besides halo. The re- 
sult of this comparison will show the amount 
of halo which usually operates in rating. 

The actual amount of the halo contained 
in 7_ cannot be obtained until the true inter- 
correlations for the various traits are avail- 
able; since the determination of the halo in- 
volves both of the two sets of intercorrela- 








a 
¢ 
: 
= 
‘ 


Mar 


tion: 
the 
inte! 
pair 
pair 
is, ¢ 
pres 
true 
is, ? 
. 
Ny 
the 
pre 


Me 








alo, 


| the 
1 by 
ions 
> of 
Tpp’ 
380. 


hile 
. Te- 
rage 

be- 
hese 
ugh 


effi- 


2S is 
nts. 
iber 
of a 
ula 


} by 
10se 
dif- 








i Me Re SENN I SEIT 










March, 1937| 


tions. The true intercorrelations should be 
the average of an infinite number of sets of 
intercorrelations for the nineteen traits, each 
pair of which should be estimated by every 
pair of an infinite number of raters. This 
is, of course, an actual impossibility. In the 
present study, only an approximation to the 
true intercorrelations has been obtained, that 
is, ry by formula (1). 


According to formula (3), the differences, 
r, minus rr, for each pair of the traits, give 
the amount of the halo. This comparison is 
presented in Table VIII. 


TABLE VIII 


ANALYSIS OF PERSONALITY RATING 239 


The resulting differences for the various 
combinations of the nineteen traits are prac- 
tically all positive. The differences range 
from —.064 to .385, with an average of .140; 
and over two thirds of them are larger than 
.100. The consistent appearance of positive 
differences may be taken as evidence for the 
existence of halo effect in the ratings. 

It seems reasonable that the ratings on the 
various traits were affected by the halo in dif- 
ferent degrees; as the ratings on the more 
specific traits naturally would be less af- 
fected and the ratings on the more general 
traits would be more affected by the halo. 


DIFFERENCES BETWEEN AVERAGES OF INTERCORRELATIONS FOR THREE TEACHERS (rx), 
AND “TRUE” INTERCORRELATIONS (?'r) 


Pair of Traits 
Mental alertness v industry__-_----- i eaikies 
‘i “ I 
“3 = el eee 
di 7 ”” neatness_______ 


- 
- 
- 


sociability _____. ihe iatrit 
emotional balance_________-_- 
muscular coordination ___-___ 


” ” ’ 


. 
+ 
- 


attentiveness. _____- 

3 ” intellectual curiosity ________- 
“2 si I ic owe occ nonce se 
: os ” originality _ __ ne et 
‘ ss *” accuracy --_---_- 

= ”” promptness_ _ aeate 

Ss ** independence. ______________- 
os “a I tS io iin ioe ch wisi 

“i ” self-assertion____- 
“ ” cooperation _ 


” facing reality - 


Industry v. carefulness__.....____- 
i i =e 
neatness__.____ __- 
sociability _ ____- 
emotional balance fant 
muscular coordination _- 
attentiveness___ es 
intellectual curiosity _ - 
persistence_____ _ ; 
originality _ 
accuracy _ _ _ 
promptness - ee 
independence __ _-__- 
leadership ____- 
self-assertion___- 
cooperation __-__- 
facing reality. __- 


. initiative____- 


“we 0UlllC eee oe 
“2 fe EE al 
ee ” emotional balance__.____________- 
r ”* muscular coordination___________- 
s — 
zn ra intellectual curiosity. _._________- 


persistence ___. 


ry r. rr ry 

oad dod: 2) . 325 128 
. 483 . 358 125 

aes . 732 494 . 238 
ieuedi 288 244 044 
sha 479 . 262 217 

. 588 . 349 239 

. 270 . 206 064 

489 271 218 

. 638 441 197 

a __ 616 . 355 261 
Stated . 453 . 354 . 099 
678 473 205 

ae 607 . 374 233 
Seer . 566 . 388 .178 
pone aes . 582 . 456 126 
ae . 530 . 280 

. 350 243 107 

pace . 588 . 363 225 
sae = . 607 . 402 . 205 
a _..  .646 . 460 . 186 
eR INS 447 . 352 095 
RN . 168 . 067 101 
515 . 313 . 202 

Segawnuieat 287 271 016 
ST PF 413 . 423 
{stan . 341 277 . 064 
ea 759 374 . 385 
Selene aare 440 . 331 . 109 
Ret 707 . 420 . 287 
on “_* . 672 396 276 
KF as . 600 . 352 . 248 
ooee or . 284 . 226 . 058 
DRE — . 222 104 
mathe - . 586 314 272 
ae . 599 371 228 
Fo a TO ee . 480 425 . 055 
ane a) . 682 431 251 
rete re . 164 001 
core ee . 499 298 201 
LISLE 455 . 329 . 126 
i) ales . 692 . 339 . 353 
© yer . 334 . 280 . 054 
cir Ail . 528 . 319 . 209 














COE ENR > ie ~ 


240 


Carefulnhss vy. 


Ne camene<c ee 
2 ek SER aa. Ae 
ig < SGN SGS Sai aee: 
- ** muscular coordination....._................... 
“ Se St ee «co 
i ~ Mepepeeeens cuvtedity......................... 
i iE TE ER ETT 
es wis A DN I RRO NE aN pd 
i IE I ree. 
” iy,  . ini aGR ees geet le pa 
a4 ENE RP RE SRE ria 
= a j= =a eee iy ieee ie casts 
£ ‘te RL IL 
i ~, ] Sapa 
= eo ee eae 

PON WI oad nen ecndnceconncsccdodciaces 
2s ~~ Gaia ara 
ui ** muscular coordination__..._...__.__._.__________ 
wi RE RL RT 
nT TI n,n cw rncecacusssuceuniic 
id «, yh eS pai ae 
‘ef i a 
“6 ch Sa a oe 
os ERAS ALD TEE TIT PRET TE 
#2 EEN, TT EE IN 
oy ~ Si +. pi aS pia 
ni REET ES ETE ETD 
e ES RE RE GEE ORE 
*s ” facing reality __ _- s “ee 

Sociability v. emotional balance.._____..._________._____. 
" * muscular coordination........__..._._______. 
. ” attentiveness__. eee ew onan & ac ric 
2 ” intellectual curiosity __- ‘ a ded : 

‘2 ” persistence. ___ Eda siasctcialse cin bea ada ats Ca 
is ee nie ren xe ee 
€ I to ic i wee ee oe 
9 es vi git toni uid x onnindbongahilentan ae 
- A lS EI PELE 

a RTT. ESOT I Le 
: - Set See aor es: 
" Pa SEES SSS ee ee 


Emotional balance y. 


JOURNAL OF EXPERIMENTAL EDUCATION 


TABLE VIII (Continued) 


[Vol. 5, No. 3 


J 


DIFFERENCES BETWEEN AVERAGES OF INTERCORRELATIONS FOR THREE TEACHERS (vn), 
AND “TRUE” INTERCORRELATIONS (7r) 


Pair of Traits 


originality 
accuracy 
promptness 
independence 
leadership 
self-assertion 
cooperation 
facing reality 


facing reality 


romptness_________ = Ea ee x 


independence 


leadership _ _ _ 
self-assertion_ 
cooperation __ 
facing reality __ 


Ty 


. 369 
. 653 
. 526 
. 610 
. 321 
. 331 
. 376 
. 500 
. 395 
. 373 
. 549 
. 335 
. 565 
. 605 
. 676 
. 584 
. 729 
. 607 
. 608 
. 506 
. 646 
- 472 
- 666 


- 104 
. 286 
. 549 
. 452 
. 189 
- 401 
. 278 
.412 
. 395 
. 267 
. 232 
.121 
.314 
. 304 


. 426 
. 144 
. 215 
. 255 
. 272 
. 328 
. 216 
. 281 
. 269 
. 484 
. 559 
. 512 
- 409 


. 273 
- 421 
. 556 
. 365 
-419 


. 480 


T, 


. 310 
. 383 
. 346 
. 335 
. 237 
. 210 
. 275 
. 305 


. 385 
. 231 
- 400 
. 291 
. 395 
- 425 
. 457 
. 394 
- 495 
- 457 
. 399 
- 407 
. 429 
. 359 
. 420 


. 057 
. 218 
. 354 
. 252 
. 172 
. 296 
. 192 
. 307 
. 298 
. 260 
. 194 
-117 
. 225 
. 223 


. 337 
.178 
. 081 
. 285 
. 182 
. 220 
. 150 
. 224 
. 234 
. 303 








Ma 


Mu 


Att 


In! 


01 


A 














March, 1937] ANALYSIS OF PERSONALITY RATING 


TABLE VIII (Continued) 


DIFFERENCES BETWEEN AVERAGES OF INTERCORRELATIONS FOR THREE TEACHERS (7x), 


AND “TRUE” INTERCORRELATIONS (71) 


Pair of Traits Pu 
Muscular coordination v. attentiveness___..__________.__-.-- . 295 
a sa ” intellectual curiosity_._...._.____- . 261 
” - ce (i(iti(iti(‘“‘ . 202 
¥ “3 i ea Fae ek wminicnm . 367 
3 = cag, «aR aoe . 360 
2 x ~ —__ anos .217 
" ”* independence_____________._____- . 2738 
ee = , | SSRs e ase . 249 
‘i ei ”* self-assertion._.______.__.____.__- . 183 
2 ee i i iar sash ecehimiaeraat . 284 
. ”  ~° aces Sites . 157 
Attentiveness v. intellectual curiosity. __.._....._..._.__-_- 321 
2 ait 2 Ntoi 7 eee 574 
es me RES ae mene 456 
gs a ar ae ee 623 
- we . 633 
7 wa  =26g Re yeSeeee ais . 577 
= aa sR ee ee . 296 
ee a IR nls nnn ack oicemike anode eee . 299 
sg o eae eres . 547 
“is YE iin neha dedwuadeketdaucen . 523 
Intellectual curiosity v. persistence___.._...........-.----- -445 
= ee Ul eee . 462 
3 = 2 eae CA a cer . 558 
= ¥ oo 0 (ti(‘ét ES . 371 
- ie eer eee ae - 427 
6 = — ———_i aa . 465 
wo . ae 0 C "EE . 665 
a ‘a a 206i area .178 
3 ws 0 (ee 430 
EE rere 394 
i I a a eh ches tases . 543 
es Se nian acibnaeatihitiemnnenmactemsatienl tat 
“3 oI oan ca Cres adcaie aoe inte adelante . 594 
9 oy Ne eS. AA . 340 
= a 0 (sR SRE .474 
— 94-2 eee eS . 494 
“4 pi ERAN EE AN ap SS . 694 
I tc ng he als cei sienna . 525 
fe <td ER eee ae . 305 
7 — (aa ees . 383 
“ =; Cees . 593 
“4 ee EE = OF ee . 449 
‘ _ 4x aes a eeeeeere . 351 
#5 ie SE ERENT Tee a TES . 386 
ROE S, Dna cndencncnntigionesnscteneeaess . 506 
7 ai 0 tC SE Ree . 662 
x oe =i = RR Pee ee ae 8 . 463 
2 ee A steal ia tail dn tet bien ipiciganadats . 561 
i ae = (ii(ité‘é«é PQ . 393 
ss iE 0 (wi I a . 597 
Promptness v. independence__________.......-.-.-------- . 570 
ke i SSN see . 357 
“4 — Seen erairs ee ae aera . 425 
5 -_— «(i EE ee . 525 
- OPE 5 oes sn ick nd couweeieudna . 622 
Independence v. leadership._.....................----.-- . 361 
% <_< = SS Sa . 437 
os ATTA AS SAIN . 450 


Se a ew win wineiineinkee . 513 


221 


. 259 
. 315 
. 382 
. 390 
. 332 
. 399 
. 199 
- 168 
. 301 
. 320 


. 315 
. 336 
. 346 
. 320 
. 306 
. 376 
- 455 
. 214 
. 303 
. 326 
. 391 
. 365 
. 823 
. 265 
. 305 
. 272 
. 336 


. 379 
. 343 
. 355 
. 281 
. 329 
. 230 
. 299 


. 362 
. 410 
. 339 
. 364 
. 198 
. 336 


. 288 
. 323 


302 
‘362 


. 273 
. 308 
. 326 
. 807 


. 203 
. 130 
. 126 
. 212 
. 051 
.121 
. 089 
. 210 


. 127 
. 068 
- 152 
. 362 
. 271 
. 075 
. 170 


. 358 
146 


. 028 
. 312 
. 120 
.121 
. 087 


.144 
. 252 
. 124 
- 197 
. 195 
. 261 


. 282 
. 034 
.119 


. 260 
. 088 


124 
. 206 


241 











oe 








Pal i 


242 JOURNAL OF EXPERIMENTAL EDUCATION 


[Vol. 5, No. 3 


TABLE VIII (Continued) 


DIFFERENCES BETWEEN AVERAGES OF INTERCORRELATIONS FOR THREE TEACHERS (1x), 
AND “TRUE” INTERCORRELATIONS (rr) 


Pair of Traits 


Leadership v. self-assertion_______- 
”* cooperation... ....--- 


Self-assertion v. a a Saale hw eee 
OI ys Gn niiinneeoneie a or 


Cooperation v. facing reality 


ST ee eee 


The average of the correlations due to the 
halo is .140, which may be regarded as the 
average amount of the halo found in the rat- 
ings. It is about half as large as the average 
of the “true” intercorrelations, namely, .306. 
Thus the halo effect in the ratings is quite 
significant. A rating on one trait does not 
have a completely specific value, when it con- 
tains halo. Its validity is thus impaired. 


SUMMARY 


1. According to the evidence obtained, the 
existence of the halo effect in the ratings has 
been proved. Although the ratings were given 
with respect to specific actions in terms of 
which the traits were defined, the halo was 
still a significant factor operative in the teach- 
ers’ estimations. 

2. Raters probably differ to some extent in 
making this error of judgment, and traits for 
estimation are subject to the halo effect in 
different degrees. From the present data, it is 
.140 on the average. The validity of the rat- 
ings is thus impaired. 


PART Ill 
FACTOR ANALYSIS 


The significance of personality ratings may 
be examined with the technique of factor 
analysis. In the ordinary sense, a trait is 
supposedly a mental entity; whence the rat- 
ings on the various traits should be independ- 
ent of one another. In human judgment, 
however, various traits can hardly be inde- 
pendently rated, because irrelevant as well as 
relevant factors often operate in the rater’s 
mind. What are the operative factors? And 
what proportion does each of them play in 
accounting for the variance of the ratings? 
These may be answered through factor 
analysis. 

This study is based on ratings by three 
teachers of their roo pupils on nineteen traits. 


Sal a TS ai cs rdosto asain 





Ran f. ryt, 
ate da le seek - 653 - 480 .173 
inks se isi . 335 . 232 - 103 
apie Ses - 400 .214 . 186 
aii eo é . 310 .212 . 098 
Pelee'se eae - 495 . 303 . 192 
eee . 545 . 278 . 267 
ate ie Bae . 456 . 306 . 140 


The source of the data was described in 
Part I. 

The analysis was begun with the “true” in- 
tercorrelations, or ry, for the nineteen traits. 
As discussed in Part II, the “true” intercorre- 
lations are approximately free from halo effect. 

For analysis, the B-factor technique’* was 
applied. According to the uniformity of the 
B-coefficients, there seems to be only one gen- 
eral factor running through this table of corre- 
lations. After the removal of this general fac- 
tor, the residual correlations appear altogether 
insignificant. The distribution of the resid- 
uals with its mean, standard deviation and 
probable error are given below. ‘The probable 
error of zero correlation is also given for com- 
parison. 


TABLE IX 
FREQUENCY OF RESIDUAL CORRELATIONS 
Residuals Frequency 
-1700—- = .1899 ________-- 2 
-1500—- .1699  -.____---- 2 
-1300— (ee 1 
-1100— = .1299 -________-_ 4 
.0900— .1099 __________ 4 
.0700— .0899 __________ 5 
0500—- .0699 __________ 7 
03800— .0499 _________-_ 15 
0100—- .0299 _________- 23 
—.0100—- .0099 -___._____ 33 
—.0300— —.0101 -__._____- 33 
—.0500— —.0301 ______.--- 17 
—.0700— —.0501 ____-.-- 10 
—.0900—- —.0701 __________ 5 
—.1100—- —.0901 _________-_ vi 
—.1800— —.1101 -— .___ .-- 2 
—.1500— —.1301  -________- a 
—.1700— —.1501 --_------- 2 
Number of residuals ____-_--_ 171 
| i Rar eS .00003 
Serer Sree .0583 
ae eae .0393 
P. E. of zero correlation___-_ .0674 


14 Agog | Report on Spearman—Holzinger Unitary Trait 
Study, No. 7 J. Holzinger and F. Swineford, An Objective 
Method for Aiioieting Tests for the Bi-Factor Pattern and 
Steps for Pattern Evaluation, Prepared at the Statistical Lab- 

oratory, Department of Education, The University of Chicago 
1936. (Lithoprinted.) 














Traits. 
corre- 
ffect. 
* was 
yf the 
> gen- 
corre- 
1 fac- 
tether 
resid- 
1 and 
bable 
com- 


y Trait 
bjective 
ern and 
‘al Lab- 
Chicago 











March, 1937] 


This distribution is nearly normal, with a 
mean of about zero; and even the largest of 
the residuals is less than three times the 
probable error of zero correlation. Thus, all 
of the residual correlations may be regarded 
as due to chance. The application of the 
tetrad criterion to the comparatively large 
residuals also failed to reveal any extra fac- 
tors. The evidence seems sufficient for con- 
cluding that only one general factor runs 
through the “true” intercorrelations. 

The nature of this general factor may be 
examined from its loadings in the various 
traits; they are presented in rank order as 
follows: 

TABLE X 
GENERAL FACTOR LOADINGS 


Loadings of 


Ratings General Factor 
OE See -7493 
Mental alertness __.._-_-- .6568 
ADRES enews innn anne .6463 
PORE. nn. nentenscxas -5979 
Independence __-_----~--- .5924 
CO a .5920 
CE: oi omnnmwene .5797 
ee 56772 
Self-assertion _._..------ .5696 
GREE. sccccencccece .5680 
Intellectual curiosity -~_-- .5623 
Emotional balance _____- .5618 
Facing reality _.....---- .5497 
MERE. emndcmnancen -5384 
Attentiveness ......__--- .4810 
Cooperation ............. A732 
Muscular coordination __- 4509 
_ aes 38715 
ee .3715 


The loadings of this factor in the various 
ratings are quite high. In view of the fact 
that the traits of the present study are all de- 
sirable qualities, the general factor obtained 
from them indicates probably some good per- 
sonality quality which is fundamental in 
nature. 

If a rater has only a vague conception of 
the traits, an overlapping may result in his 
various ratings. This overlapping would 
raise the intercorrelations between the traits 
to too high a value, and thereby create a false 
general factor. This error is probably very 
small in the present case, if it occurs at all. 
The reason is that the traits were defined in 
terms of specific actions, which could hardly 
be misconceived. 

One might again suspect that the general 
factor is due to age, which is usually respon- 
sible for positive correlations between mental 


ANALYSIS OF PERSONALITY RATING 243 


abilities or traits. This explanation does not 
apply to any great extent in the present data, 
since the subjects of the present ratings were 
in the same grades and were of about the same 
age level. 

All these facts seem to indicate that the 
general factor found from these data is some 
fundamental quality, or at least that it is so 
according to the judgments of the teachers. 

Further analysis of the loadings of the gen- 
eral factor in the various ratings will clarify 
the nature of this factor. As shown in the 
foregoing table, the loadings range from .3715 
to .7493. Initiative has the highest loading. 
Next rank mental alertness, accuracy, prompt- 
ness, independence, industry, carefulness, per- 
sistence, and self-assertion. All these traits 
ranking above the median loading are some- 
what volitional in nature and are essential 
for school achievement. The two intellectual 
traits, originality and intellectual curiosity, 
rank in the middle. The emotional traits, 
emotional balance and facing reality, come be- 
low the middle. Then come leadership and 
attentiveness. Down at the bottom of the 
rank order are the social and physical traits, 
namely, cooperation, muscular coordination, 
neatness, and sociability. This order of the 
loadings seems to indicate that the general fac- 
tor is in close relation to the volition to 
achieve in school work. This volition being 
contrary to biological drives is probably a re- 
sult of training. Since most of the loadings 
in the various ratings are quite high, with a 
median of .5680, this factor is probably per- 
vasive in scope; it may run also through 
many other traits. It is a fact that the voli- 
tion to achieve in a task is usually responsible 
for developing many of the desirable traits for 
that task. 

The volition to achieve may be called a 
general factor of will-power. It was also 
found by Webb*® and advocated by Spear- 
man’® as one general factor underlying human 
personality and character. 

In view of the fact that the estimates may 
be different from measurements on perform- 
ance tests, it is an interesting question to ask 
whether the general factor would be as large 
as is found from the ratings, or if it would 
come out at all, if the traits were measured 
with performance tests. Its answer is beyond 
the scope of this study. 


%3E. Webb, Character and Intelligence, British Journal of 
Psychology Monographs, No. 3 (1915), p. 60. 
C. Spearman, Abiliiies of Man (New York 


Macmillan 
Co., 1927), p. 347. 
















244 JOURNAL OF EXPERIMENTAL EDUCATION 


From the intercorrelations due to the halo 
effect, (ru—rr), a general factor has also 
been obtained by the same procedure. The 
residuals after the removal of this factor are 
practically all within the range of three times 
the probable error of zero correlation. As 
this set of correlations is presumably due to 
the halo effect, as discussed in Part II, the 
factor obtained therefrom is the general fac- 
tor of halo. Its loadings in the various rat- 
ings are as follows: 


TABLE XI 
Hato Factor LOADINGS 
Loadings 
Ratings of Halo 
EE a -7116 
Pacing reality ..........---- 5309 
BO ae .5263 
IE Bia sisichasdiiuncsin inandeininas -5025 
Mental alertness ..--.------- A797 
Emotional balance ~..------- 4323 
Co 4271 
| SE eee eee .4244 
eae 4214 
ae 4163 
Independence ~.------------- 4157 
Self-assertion  ~..........-.- .3829 
ge ERR .8048 
a -3016 
SEE eee .2847 
CWIGINEEY .ccnncccncnnnssee .2729 
deen tet ecnsmninds .2684 
Intellectual] curiosity .....--- .2646 
Muscular coordination ...__-- .0768 


The halo factor runs through the rating on 
every one of the traits; its loadings vary 









[Vol. 5, No. 3 


widely with the traits, ranging from .0768 to 
.7116. As is to be expected, the rating on 
the trait which can be most readily observed, 
namely muscular coordination, has almost zero 
loading with the halo factor. Similarly, the 
ratings on neatness, intellectual curiosity, 
originality, and sociability have also compara- 
tively low saturations of the halo. On the 
other hand, the ratings on the somewhat 
vague traits, such as persistence, facing real- 
ity, attentiveness, accuracy, mental alertness, 
emotional balance, etc. have comparatively 
high loadings of the halo. 


The combination of the loadings of the halo 
factor and the general factor will produce ap- 
proximately ry or the average intercorrela- 
tions by the three raters separately; they were 
shown in Table I in Part II. 


The uniqueness of each rating can be de- 
termined after the coefficients of the common 
factors are known. If C designates the coef- 
ficient of a factor, H and W designate the 
halo factor and the general factor of will- 
power respectively, and U designates the 
uniqueness; the coefficient of the latter may 
be obtained from the formula, C? + C?w + 
C*y ==1. The uniqueness of each rating con- 
tains both the specific factor and the chance 
error factor. The latter is the unreliability 
of each rating, and its coefficient is \/ 1 — r?1). 
The coefficient of each specific factor can be 
obtained by removing the chance factor from 
the uniqueness by the formula, C?s + C?, = 





TABLE XII 


FACTOR PATTERN 


Factors with Coefficients Total 
Traits H Ww Ss E Variance 

1 Mental alertness_____________ 4797 . 6568 . 5482 . 1949 1. 0001 
“aes . 3048 . 5920 . 6787 . 3100 1.0001 
8 Carefulnmess__.._____________- .4214 . 5797 . 6242 . 3110 . 9999 
) a a air . 4244 . 7493 . 4677 . 1990 . 9999 
GS sree... .....-........ . 2847 .4416 . 8378 . 1482 . 9999 
6 Sociability.................. . 2684 8715 . 8793 . 1300 1. 0000 
7 Emotional Balance___________ . 4323 5618 . 6795 . 1890 . 9999 
8 Muscular coordination_______- . 0768 4509 . 8748 . 1595 . 9999 
9 Attentiveness..._____________ . 5263 4810 . 6655 . 2210 1. 0000 

10 Intellectual curiosity ____.___- . 2646 5623 . 7698 . 1460 1. 0000 

11 Persistence__....._._______-_- . 7116 5772 . 3074 . 2570 1.0001 

BE” ae . 2729 5680 . 1597 . 1610 1 

Be icone cccweerencnwece . 5025 6463 . §271 . 2280 1 

14 Promptness________________- . 4271 5979 . 6265 . 2600 1 

15 Independence___.___________- . 4157 5924 . 6077 . 8265 1 

16 SE aE ae . 8016 5384 . T758 . 1320 1 

17 Self-assertion.__.___________- . 8829 5696 . 7118 . 1490 

18 Cooperation. -_............-.- ; 1 




























EA gg eM ITI SS he 


Sig Ms age LANL Re 






















3 to 

on 
ved, 
ZeTO 
the 
ity, 
ira- 
the 
hat 
eal- 


ely 


alo 
ap- 
ela- 
ere 


ET SRE ION Ee Bae oe 


cya 


Wm: a 


Wades. EL as 


March, 1937) 


C*y2' The resulting factor pattern is pre- 
sented in Table XII. 


The coefficients of the specific factors vary 
widely with the various ratings, with a range 
from .3074 to .8793. The ratings on muscu- 
lar coordination, neatness, and sociability have 
specific loadings above .8000; while those on 
persistence, initiative, mental alertness, ac- 
curacy, and facing reality have comparatively 
low specific loadings, below .6000. The for- 
mer traits do seem more accessible to obser- 
vation than the latter. So the result is quite 
sensible. 


The chance error factor varies also in the 
various ratings, ranging from .1300 to .3265. 
Sociability, neatness, intellectual originality, 
leadership, and self-assertion were rated with 
comparatively less chance error; while indus- 
try, carefulness, and independence were rated 
with more chance errors. As a whole, how- 
ever, the chance errors are all quite small. 


It is interesting to compare the loadings of 
the two general factors in each of the ratings. 
They are independent from each other. One 
rating may have a low loading of one of the 
two factors but a high loading of the other. 
For example, the rating on industry has a 
comparatively low loading of halo, while it 
has a comparatively high loading of will-power. 
The result is just what is to be expected. As 
industry is a trait which can be easily observed 
and specifically rated, it would be accordingly 
less affected by the halo. And, since indus- 
try is a result of will-power, naturally its rat- 
ing is highly saturated with the general factor. 
These two factors, although both running 
through all of the ratings on the various traits, 
are distinct in nature. While the halo is a 
psychological error of a rater’s judgment; the 
will-power belongs to the personality of the 
rated. The two general factors would come 


™L. L. Thurstone, Vectors of Mind, University of Chicago 
Press, Chicago, 1936, p. 68. 


ANALYSIS OF PERSONALITY RATING 245 


out together, were it not for the statistical de- 
vice of their separation, as discussed in 
Part II. 

According to these data, there are three fac- 
tors besides chance errors in each rating. 
One of the main purposes of this study was to 
show what proportion each factor plays in ac- 
counting for the total variance. On the aver- 
age, as presented in Table XII, the halo fac- 
tor accounts for 17.13 per cent, the general 
factor accounts for 31.59 per cent, the specific 
factors account for 46.53 per cent, and the 
chance errors account for 4.75 per cent. The 
halo is an inescapable error in rating method; 
but it appears not to be so large as it was 
commonly believed to be. This is probably 
due to the specific definition of the traits in 
this study. The chance errors are almost 
negligible. The general factor is probably as 
important as the specific factors for the un- 
derstanding of a personality. The combina- 
tion of both the general factor and the specific 
factor for the rating on a trait accounts for 
78.12 per cent of the variance. Thus, the rat- 
ings are quite significant for the study of per- 
sonality. 

SUMMARY 


According to these data, there are four fac- 
tors operative in each rating, namely, the gen- 
eral factor of will-power, the specific factor, 
the halo, and the chance error. 

The loading of each factor in each rating 
varies with the traits. On the average, the 
specific factor accounts for a little less than 
one-half of the total variance, the general fac- 
tor of will-power accounts for about one-third, 
the halo accounts for about one-sixth, and the 
chance error accounts for about five per cent. 

Since both the systematic error of the halo 
and the chance error are rather small and the 
specificity of the ratings is high, the ratings 
seem to be quite useful for the study of per- 
sonality. 





<< 


cae a 


- ee 





ne 








ponent OTT TEL POT Se Fe 


cific item validities of the test. 


A NOTE ON THE VALIDITY AND DIFFICULTY OF ITEMS 
IN FORM A OF THE OTIS SELF-ADMINISTERING 
TESTS OF MENTAL ABILITY* 


ALPHONSE CHAPANIS 
Connecticut State College 


Although the Otis Self-Administering Test 
of Mental Ability (1) is widely used, there 
seems to be no published data concerning spe- 
The purpose 
of this study was to obtain measures of the 
validity and difficulty of test items in Form A 
of the higher examination of this test based on 
a group different from that on which the orig- 
inal item validities were established. As a 
matter of interest, these were computed sep- 
arately for the sexes in order to note whether 
there might be any significant sex differences 
in the validity or difficulty of test items. 

Otis (2; Pg. 3) states that the items were 
originally validated on “about 1000 high 
school students and 1000 grammar 
school pupils . . .”’ The subjects in the 
present study were 100 men and 100 women 
who had taken the test in connection with an- 
other investigation. The group of 100 women 
had a mean Otis I. Q. of 108 with the ex- 
tremes at 72 and 140. Ages varied from 18 
to 33. The group of men had a mean Otis 
I. Q. of 111 with a range of 72 to 135. Ages 
varied from 18 to 32 with three exceptions— 
one 36, one 37, and one 52. The mean age 
was 25. 

Concerning the original validation, Otis 
states (2; Pg. 3): 

“These students were divided in each case 
into two groups, a ‘good group’ and a ‘poor 
group’. The same number were taken from 
each grade from both groups. The good 
group constituted the young students, and 
the poor group the old students. These 
groups had reached the same average edu- 
cational status, therefore, but at different 
rates. Now it is the rate at which a stu- 
dent can progress through school that the 
mental-ability test is chiefly used to pre- 
dict. Therefore this is believed to be the 
best criterion by which to judge the valid- 
ity of each item that goes into the test. 
The number of times each item was passed 
by each group was then found and only 


* The writer wishes to express his indebtedness to Dr. E. 
Lowell Kelly ier his helpful suggestions and criticisms. 


- 


those items chosen which showed a distinct 
gain in number of passes by the good group 
over the number of passes by the poor 
group in spite of the fact that the median 
age of the good group was over two years 
less than that of the poor group. Each 
item justified its inclusion, therefore, be- 
cause it distinguished between students who 
progressed slowly and those who progressed 
rapidly. 

“The items in each form of each exam- 
ination have been arranged in the order of 
difficulty, according to the number of passes 
of each item by the students taking the 
preliminary editions.” 


In this study the number of correct re- 
sponses for each item constitutes a measure of 
the difficulty of that item. The item validity 
was determined by means of the biserial co- 
efficient of correlation computed with the aid 
of a nomograph furnished by Dr. Jack W. 
Dunlap.* The formula given on the nomo- 
graph is: 

Sas a in which 
Y», == biserial coefficient of correlation 

p = percent in category 
M, = mean of the category 
M, = mean of the total distribution 

o = standard deviation of the total 

distribution 


Toi 


This formula is derived from that given in 
Dunlap—Kurtz (3; Pg. 123). 

The table below shows the biserial coeffi- 
cients and the percentage of correct responses 
by sexes for each item of the test. It will be 
noted that the data for items 40 and 45 have 
been analyzed twice. These items on the test 
read as follows: 


(40) If 2% yards of cloth cost 30 cents, how 
many cents will ro yards cost?__...--- 
(45) If 4% yards of cloth cost 90 cents, how 
many cents will 2'4 yards cost?____-_- 


* Dunlap, Jack W., ““Nomograph for computing bi-serial cor- 
relations.’’ Psychometrika, V. 1, No. 2, pp. 59-60 (1936). 


46 


TS a 


op 


ee 


Paaneddinaba sg emlaiaist 


SAS C6 hig oat 


Marc 


Teste 
last p 
the 4 
corre: 
in ce 
comp 
answ 
recor 
dolla 
piont 
if th 
Thus 
while 
and | 


“7 


TA 








inct 
up 
oor 
lian 
ars 
ach 
be- 
who 
sed 


4m- 
- of 
3SES 
the 


> of 
lity 
co- 


aid 


no- 








TP Parnes 


fe iene a ble 


March, 1937) 


Testees often misinterpret or overlook the 
last part of these two problems which asks for 
the answer in cents and record instead the 
correct digits but fail to express the answer 
in cents. Biserial r’s for these items were 
computed under two conditions: (a) the 
answer Called correct if the correct digits were 
recorded — leaving out of consideration the 
dollars and cents notation and the decimal 
piont; and (b) the answer called correct only 
if the correct digits were recorded in cents. 
Thus in item 40, 120 and 120¢ were correct 
while 1.20, 1.20¢ and $1.20 were incorrect 
and in item 45, 50 and so¢ were correct while 
so, .50¢ and $.50 were incorrect. It is in- 


OTIS TESTS OF MENTAL ABILITY 247 


teresting to note that the validity drops much 
more in item 40 when the correct notation is 
required. This may be due to the fact that 
we are more prone to think in terms of fifty 
cents than in terms of one-hundred and 
twenty cents. 

Regarding the difficulty of the items, a cur- 
sory examination of the table will show that 
for this group, at any rate, certain items are 
misplaced. Items 9, 34, 52, and 53, for ex- 
ample, are too easy for their respective posi- 
tions, while items 21, 37, and 44 are too dif- 
ficult for their positions. 

Validity coefficients range from .153 in item 
59 to .9g in item 1 with a median value of .610 





TABLE SHOWING THE BISERIAL COEFFICIENTS OF CORRELATION AND PERCENTAGE OF CORRECT 
RESPONSES FOR EACH ITEM oF FoRM A—HIGHER EXAMINATION— 
OTIS SELF-ADMINISTERING TEST OF MENTAL ABILITY 


Item Biserial r Percent Passing 

Number Male Female Male Female 
ad | . 930 . 990 96 97 
002 . 766 . 450 89 83 
3 . 250 . 877 86 86 
vee | . 760 . 353 93 89 
5 . 227 . 455 85 88 
**6 . 390 . 297 86 84 
7 . 502 .472 84 84 
& . 540 .419 79 86 
00g . 955 . 620 99 99 
10 . 743 . 510 97 95 
sad 3 | .313 . 610 92 91 
12 . 678 . 880 98 93 
13 . 500 . 740 92 91 
14 . 663 . 645 80 82 
15 . 647 . 902 80 68 
16 . 604 . 573 93 91 
17 . 581 . 748 94 91 
18 . 480 . 438 90 86 
19 . 590 . 690 97 94 
**20 . 284 . 262 87 92 
21 . 390 . 337 61 68 
22 . 651 . 560 87 86 
23 . 293 . 531 92 85 
24 . 410 .401 80 17 
**25 . 242 . 204 65 70 
26 . 584 . 560 94 90 
27 784 . 648 99 96 
a 322 . 800 91 88 
**29 285 . 340 92 95 
30 696 . 631 91 89 
31 498 . 768 90 86 
32 640 . 575 97 94 
33 480 . 253 82 81 
*34 . 900 100 98 
35 686 . 600 85 71 
36 493 . 524 94 88 
37 . 620 . 638 62 52 
38 . 718 . 699 74 63 
39 . 620 . 580 76 76 


“indicates five most valid items 
**indicates five least valid items 
***indicates pronounced sex differences in item validity 
# indicates pronounced sex differences in item difficulty 


Item Biserial r Percent Passing 

Number Male Female Male Female 
40a . 550 . 610 62 63 
b .118 . 440 24 12 
41 . 706 . 830 79 76 
42 . 650 .470 74 74 
43 . 660 477 59 44 
44 . 701 . 601 36 34 
45a . 765 . 660 62 41 
b . 608 . 680 39 27 
46 . 392 . 533 85 77 
47 . 393 . 512 68 78 
#48 . 520 . 480 83 60 
49 . 536 . 440 56 53 
50 . 518 .615 51 44 
51 447 . 524 59 63 
52 .401 . 560 84 80 
53 . 571 .473 90 85 
54 . 590 . TAT 62 61 
55 . 540 .272 60 65 
# 56 . 639 . 650 64 45 
57 . 630 . 720 59 64 
58 . 808 . 122 44 40 
**59 277 . 153 23 17 
60 . 638 . 824 55 44 
#61 . 710 . 812 55 36 
62 . 827 . 610 59 48 
63 . 173 . 132 68 61 
64 . 710 . 750 54 45 
*65 . 810 . 850 64 60 
66 . 735 . 830 46 31 
*67 . 816 . 800 28 20 
68 . 498 .470 25 24 
69 . 671 . 687 43 32 
70 . 596 . 598 20 18 
71 . 680 . 682 54 47 
“72 . 940 . 810 34 21 
73 . 874 . 700 28 25 

74 . 655 . 651 34 29 : 
75 . 782 . 800 16 12 





ee 


te a. ie 





—* 





OS 
emt a 


Sewer | 





248 JOURNAL OF EXPERIMENTAL EDUCATION 


for the group as a whole. The median coef- 
ficient values for the sexes separately coincide 
at .610. The five least valid items for both 
sexes are 59, 25, 20,29 and 6. ‘The five most 
valid items are 1, 34, 72, 65 and 67. 

Sex differences in difficulty are most pro- 
nounced in items 48, 56 and 61. These dif- 
ferences are all significant (Critical ratios, i.e. 

diff. 
S.E.aitr. 
ity are most pronounced in items 28, 4, 9, 2, 
and 11. In attempting to explain these sex 
differences, the recent work of Terman and 
Miles (4; Pg. 384) may be helpful. They 
point out that on an information test there 
are certain kinds of activities about which fe- 
males are apt to have more knowledge than 
males and vice versa. In the Otis test, for 
example, items 11, 12, and 13, which are more 
valid for females, deal with the interpretation 
of proverbs. This may be related to the 
greater acquaintance which females show con- 
cerning literature and fiction. Item 4 asks 
“The opposite of honor is (?)------------ 
1 glory, 2 disgrace, 3 cowardice, 4 fear, 5 de- 
feat.” and is much more valid for males. 
This may be interpreted in view of the fact 
that males show more knowledge concerning 
“physical . . . facts . exploit, adven- 
ture, invention whether in fact or fiction and 
where the topic is one that appeals to the pug- 
nacious, aggressive or vigorously active ten- 
dencies” (4; Pg. 384). Likewise item 48 
asks “The opposite of treacherous is (?)-~-_- 


, exceed 3). Sex differences in valid- 


| Vol. 5» No. 3 
1 friendly, 2 brave, 3 wise, 4 cowardly, ; 
loyal” and is significantly harder for the fe- 
males. It is the author’s suggestion that sex 
differences in item validity and difficulty may 
possibly bear some relation to sex differences 
in knowledge regarding certain types oj 
activity. 

In conclusion, the results of this study seem 
to indicate that, with few exceptions, the item 
validities and difficulties of the Otis Self-Ad- 
ministering Test of Mental Ability (Form A— 
Higher Examination) do not consistently fa- 
vor either sex. For a relatively unselected 
group of adults such as was used in this study 
the general validity of the items is high, al- 
though the items are of less than average dif- 
ficulty for this group. 


REFERENCES 


1. Otis, Arthur S. Otis Self-Administering 
Test of Mental Ability—Higher Examina- 
tion Form A New York: World Book Co.: 
1922. 

2. Otis, Arthur S. Otis Self-Administering 
Tests of Mental Ability—Manual of Di- 
rections and Key New York: World Book 
Co.; 1928. 

3. Dunlap-Kurtz Handbook of Statistical 
Nomographs, Tables, and Formulas New 
York: World Book Co.; 1932. 

4. Terman, Lewis M. and Miles, Catherine 
Cox. Sex and Personality New York: 
McGraw Hill Book Co., Inc.; 1936. 





ee 


lag 8 Ae Salam Ae EOE SEN ta AO. ti 





= PER eR CES AN 


AN 


the 
secu 
to | 


test 
are 
cure 
the 


whe 
in V 
fror 
stin 
test 
spo 
kno 
chil 
call 
and 


exp 
call 


wer 


Reli 
cabi 
438. 








Vo. 3 


lly, 5 
he fe- 
it sex 
7 May 
“ences 
aS of 


seem 
» item 
|f-Ad- 
n A— 
ly fa- 
lected 
study 
h, al- 
e dif- 


tering 
mina- 
t Co.: 


tering 
f Di- 


Book 


istical 
New 


nerine 
York: 








eee eee ae ere 


Nichdinslaati adits ction 





ci is aR 9 


OAR te 


AN EXPERIMENT WITH MULTIPLE CHOICE VOCABULARY 
TESTS CONSTRUCTED BY TWO DIFFERENT PROCEDURES 


Victor H. KEeLley 


Phoenix Union High Schools 
and Junior College 


Phoenix, Arizona 


In the construction of multiple choice tests 
the author is confronted with the problem of 
securing alternate responses which will prove 
to be real sources of confusion to the pupils. 
Too often, in a five response multiple choice 
test, only one or two of the incorrect responses 
are plausible answers. Pupils are able to se- 
cure the correct answer merely by eliminating 
the unreasonable incorrect responses. 

This study was undertaken to determine 
whether a multiple choice word meaning test 
in which the alternate responses were selected 
from words which pupils confused with the 
stimulus word would result in a more valid 
test than a test in which the alternate re- 
sponses were selected without any definite 
knowledge of their possible confusion to the 
child. One hundred words were systemati- 
cally selected from the Thorndike Word List 
and built into two tests. 

Test A was built without any preliminary 
experimentation. It ‘s a test built by the so- 
called arm-chair method. The five choices 
were secured by following these specifications: 


1. A word which was given as a correct 
synonym in Webster’s New International 
Dictionary should be the correct answer. 
Each synonym should be checked in the 
Thorndike Teachers’ Word Book in an 
attempt to find a word similar in mean- 
ing with a higher frequency than the 
stimulus word. This helps to assure the 
testing of a word by means of a synonym 
which is familiar to the child. Thorn- 
dike and Symonds* show that there is a 
relationship between frequency of occur- 
rence and difficulty. 

2. The other responses should be selected 
from: 

a. A word which is as nearly opposite 
in meaning to the stimulus word 
as possible. 

* Thorndike, E. L., and Symonds, Percival M., “Difficulty, 
Reliability, and Grade Achievements in a Test of English Vo- 


cabulary.” Teachers Coll 1923), } 
rw ‘eac ollege Record XXIV (1923), pp 


b. A word which has a beginning simi- 
lar to the stimulus word. 

c. A word which has an ending simi- 
lar to the stimulus word. 

d. A word which has no relation to 
the stimulus word but which, in the 
opinion of the writer, might be con- 
fusing to the child. 


3. If possible al! alternates should be the 
same part of speech.” 


Test B was constructed upon the assump- 
tion that valid alternate responses could be 
built if the most common meaning children 
attach to a word were known as well as the 
wrong meanings which were associated with 
the stimulus word. Accordingly, the one hun- 
dred test words were presented to four hun- 
dred children in the fourth, fifth, and sixth 
grades with instructions to take all the time 
necessary to supply a word similar and oppo- 
site in meaning to the words which they 
thought they knew. The responses to each 
test word were tabulated together with the 
frequency of occurrence, and the five choices 
for the test were secured by selecting: 


1. The most common response as the cor- 
rect answer if it also was a common 
meaning as indicated in Webster’s Dic- 


tionary. 

2. The three most common incorrect re- 
sponses. 

3. The most commonly given opposite 
word. 


In order to compare the validities of the 
two tests a measure was developed which 
served as a criterion. After some preliminary 
experimentation, the ability of the pupil to 
use a word in a correct sentence was accepted 
as a criterion. The one hundred words were 
broken up into two two lists and presented to 


2In the construction of Test A the following references were 
used: Thorndike’s Teacher's Word Book, Webster’s New In- 
ternational Dictionary, and Roget’s Thesaurus of Synonyms 
and Antonyms. 


249 





—— ee 


7! 





ee 
ELT ET eee 


Ce eee 





250 JOURNAL OF EXPERIMENTAL EDUCATION 


the pupils in two working periods with in- 
structions to use each word in a sentence to 
show that they knew its meaning. One hun- 
dred fifty-five pupils in grade five were used 
in this phase of the experiment. Each pupil 
took both of the multiple-choice recognition 
tests, half of them taking Test A first and the 
other half Test B. Two weeks time elapsed 
between the administration of the criterion 
test and the first recognition test, and likewise, 


[Vol. 5, No. 3 
two weeks time elapsed between the adminis- 
tration of the first and second recognition 
tests. 

The accompanying table shows the average 
score and validity coefficient of each of the 
two tests. The reliability of Test A as com- 
puted by the odd-even technique and the 
Spearman—Brown prophecy formula was .923, 
of Test B .916, and of the criterion measure 
892. 





No. of Mean Standard Validity Standard Diff. opin Diff. 
Test Pupils Deviation Coefficient Error <i Ton 
A 155 37.00 12.60 .695 . 042 ; 
. 026 . 035 .74 

B 155 39. 66 12.96 . 721 . 039 


The standard error of the difference was 
computed by a formula involving the sam- 
pling intercorrelation between the two validity 
coefficients. 

A slight advantage in the validity coefficient 
appears in favor of the test which was con- 
structed from the results of the responses of 
pupils. However, it is not significant in that 
mere chance alone is sufficient to account for 
the difference. The average score on Test A 
is slightly lower than on Test B, but the dif- 
ference again is not a significant difference. 
It is probable that the two tests measure a 
similar word meaning ability, as the correla- 
tion between the two is .783. 


In general, the results of this experiment 
would lead one to conclude that an attempt 
to get valid incorrect responses upon a word 
recognition test by the laborious task involved 
in tabulating the responses from recall tests 
is hardly justifiable. The judgment of the 
test-maker in selecting alternate responses se- 
cures approximately the same results. How- 
ever, the general conclusion that a test maker 
can by an @-priori process secure satisfactory 
alternate responses is not warranted. To se- 
cure an adequate answer to this problem, ad- 
ditional experimentation in fields other than 
word meaning should be conducted. 





BE ae i bak cick 


tim 
val 
wo! 
tio! 
are 
sir 
tiv’ 
for 
det 
the 
lar 
ne 
st 








Vo. 


3 


ninis- 
lition 


erage 
f the 
com- 
| the 
923, 
asure 


yet | Jee 


ment 
=mpt 
word 
ved 
tests 
the 
S se- 
Tow- 
aker 
tory 
) se- 
, ad- 
than 











RELATIONSHIPS BETWEEN QUALIFYING EXAMINATIONS, 


VARIOUS OTHER FACTORS, 


AND STUDENT TEACHING 


PERFORMANCE AT THE UNIVERSITY OF MINNESOT A* 


RupyYARD K. BENT 
Assistant Professor of Education 
University of Arkansas 


THE PROBLEM 


Teacher training schools have for some 
time been attempting to find more reliable and 
valid techniques for the early selection of 
worthy prospective teachers and the elimina- 
tion from the teaching profession of those who 
are less promising. It is becoming more de- 
sirable for each institution to initiate a selec- 
tive admission program or to devise standards 
for limiting its enrollment. The need is evi- 
denced in the number of students who enter 
the teaching field—a number, in many states, 
larger than the demand. Selection is also 
needed in order to protect training school 
students from poorly prepared teachers. 

Perhaps it is neither desirable, nor feasible, 
to make the selection of prospective high 
school teachers at the time of entrance into the 
college, but to make it at the end of the junior 
year. By that time, many of the unpromis- 
ing will have transferred to other fields on 
their own initiative or will have left school. 
Even if some are eliminated at that late pe- 
riod, their three years of schooling are not 
wasted, so far as a liberal education is con- 
cerned, although they suffered a delay in en- 
tering other fields of specialization. Selection 
by elimination at a later period is not as jus- 
tifiable as an early evaluation of teaching 
merit and the direction into other channels of 
those who are denied further training. 

Regardless of the great need for selective 
techniques in teacher training institutions, 
there are available very few which possess 
any great validity. Many attempts have been 
made with various criteria to make valid pre- 
dictions of teaching performance, but none 
have had any marked degree of success. Al- 
though the problem seems to be quite baf- 
fling, due to its many complications, each new 
study in the field will make one or more of 
three contributions: discover new condition- 
ing factors, supply evidence concerning cer- 


* Research Paper No. 461, Journal Series, University of 
Arkansas. 


tain factors formerly thought to be valid 
which actually have little true or prophetic 
value, or rescue one which had been discarded. 

Among the various factors which were con- 
sidered by the faculty of the College of Edu- 
cation at the University of Minnesota, were 
qualifying or comprehensive examinations. 
These have been employed because of their 
possibilities as bases for the selection of stu- 
dent teachers in the College of Education and 
to help students make adjustments as well as 
for other purposes. 


Under the direction and leadership of Dr. 
Harl R. Douglass, qualifying examinations 
were given to the juniors in the College of 
Education, beginning during the school year 
1931-32. The two principal reasons for giv- 
ing them were: (1) to make possible an ex- 
periment in seeking a reliable and valid basis 
for predicting teaching success as a step 
toward limiting the supply of teachers, and 
(2) to provide a means for assuring that those 
who do student teaching will be well prepared 
as far as academic and professional subject 
matter is concerned. Mere class attendance, 
marks, and credits, are not indubitable evi- 
dence of the candidate’s knowledge or ability. 
Many merely “serve time” for credits; others 
take so many years to secure a degree that the 
subject matter taken earlier in the scholastic 
period may be forgotten or out of date. Fur- 
ther, qualifying examinations would, perhaps, 
serve as diagnostic agents for guiding and di- 
recting the design or pattern of the prospec- 
tive teacher’s academic and professional train- 
ing, or for directing him into other channels. 

The examinations covered the fields of Eng- 
lish, education, and the major teaching sub- 
ject. They were given for the first time in 
the spring of 1932, and at the end of each 
quarter thereafter to all students who had 
completed the junior year. The successful 
completion of each examination was later 
made a requirement for entrance into student 
teaching. 


251 





ee 








252 JOURNAL OF EXPERIMENTAL EDUCATION 


The chief purpose of this report is to evalu- 
ate the relative merits of the qualifying exam- 
inations and to attempt to show their rela- 
tion, with various other factors, to student 
teaching performance in the secondary and 
elementary school. The other factors were 
aptitude tests, academic marks, and hours of 
credit. 


CONCLUSIONS FROM RELATED STUDIES 


From many criteria that have been evalu- 
ated with reference to the prediction of teach- 
ing performance, the following generalizations 


- may be made: 


1. No single factor contains all the elements 
common to teaching ability; they are found in 
many factors. 

2. No satisfactory measure has yet been 
found to have sufficient value in the early se- 
lection of teachers that it may be employed 
without fear of injustice to prospective teach- 
ers or pupils in training and public schools. 

3. The factors found to have the highest 
prophetic value in teaching as found by the 
correlation techniques, are marks in profes- 
sional and subject-matter courses, and intelli- 
gence test scores. 

4. There is a wide variation between coef- 
ficients of correlation obtained by different 
investigators employing the same variables. 
These coefficients summarized by Yaukey and 
Anderson’ reveal that of 18 studies involving 
intelligence and student teaching success, the 
range was from —.03 to .54, the median being 
.23. Of those showing the relationship be- 
tween scholarship and teaching success in the 
field, of 40 studies, the range was from .or to 
.77, the median being .31. 

5. There seems to be a closer relationship 
between intelligence test scores and teaching 
success in the field for secondary school teach- 
ers than for elementary teachers. The aver- 
age coefficient for the former was .33, and 
for the latter .o2, which suggests that factors 
found to be related to teaching success in the 
elementary school do not apply equally well 
in the secondary school. 


THE SUBJECTS 


The subjects of this investigation were 597 
juniors in the College of Education, Univer- 
sity of Minnesota, during the school year 
1931-32, who were seniors and did student 


1 James V. Yaukey and Paul L. Anderson. “A Review of 
the Literature on the factors conditioning teaching success.”’ 
nce Administration and Supervision, 19: 511-520, Oct. 
1935. 


[Vol. 5, No. 3 
teaching during the year 1932-33; and 487 
juniors for the following year who did student 
teaching in 1933-34. The total number of 
cases for the two year period was 1084, repre- 
senting 30 major subject groups. They were 
preparing to teach in both elementary and sec- 
ondary schools. Two departmental groups 
only were preparing specifically for the ele- 
mentary school. 


DESCRIPTION AND SCOPE OF THE 
QUALIFYING EXAMINATIONS 


The examinations were divided into four 
units: 

(1) Professional subject matter 

(2) English composition and literature 

(3) The major teaching field on the high 
school level, Major A 

(4) The major teaching field on the college 
level, Major B 


The professional examination covered three 
fields: (a) educational psychology, (b) sec- 
ondary education, and (c) techniques of high 
school instruction, or similar fields for ele- 
mentary school teachers. 

The examination in English consisted of 
two parts, the general and the essay. The 
Columbia Research Bureau English Test was 
employed for the. former, and the students’ 
ability in composition was estimated from es- 
says written under supervision. 

The major A examinations were designed to 
measure the degree to which the candidate had 
mastered the content of the major teaching 
field as it is commonly taught in secondary 
schools. They contained material from sec- 
ondary school textbooks that are in use at the 
University of Minnesota. 

The major B examinations covered a more 
complete field. They were designed to meas- 
ure the complete and thorough mastery of the 
teaching field over and above that taught in 
the secondary schools. 

Reliability. The reliabilities of the quali- 
fying examinations, computed by the odd ver- 
sus even technique, ranged from .49 to .97, 
but the majority fell between .80 and .go. 

Other Data. Besides the scores on the 
qualifying examinations, the following data 
were secured for each subject for whom there 
was a record: 


1. Miller Analogies Test scores (a psycho- 
logical test)? 

2. College Aptitude Test scores (a psycho- 
logical test)* 





petal 


ca ee 


Ss nel Aa aS sie Ot -/ 


WAR ve 


Mar 








487 
dent 
r of 
pre- 
were 

sec- 
ups 

ele- 


four 


igh 


lege 


iree 


igh 
ele- 


of 
The 
Was 
nts’ 
es- 


1 to 
had 
ing 
ary 
eC- 
the 


ore 
as- 
the 


ali- 
er- 
97; 


the 
ata 
pre 





March, 1937} 





a 


ent Salis Ga 


bata eR OME ea oN sae ae 


(a, AE ANT 


:. Minnesota Reading Test scores (vocabu- 
lary, sentence, and paragraph compre- 
hension )* 

:. Rank in the senior class of the high 
school 

;. Honor point ratio in education, the ma- 
jor, English, and in all subjects. The 
basis for computing the ratios were: 
A=3, B=2, C=1, D=o, and 

—=—lI. 

6. Hours of credit in education, English, 
the major teaching subject, and in all 
subjects combined. 


THE CRITERIA OF TEACHING SUCCESS 


the criteria employed in the evaluation of 
the qualifying examination program were 
judgments of student teaching performance 
made by the critic teachers of the University 
of Minnesota High school, and marks on stu- 
dent teaching. For the year 1932-33 the 
judgments were rank order lists and person- 
ality ratings, and for the following year the 
University High School Rating Scale was em- 
ployed. 

Toward the end of the spring quarter all 
critic teachers in each department made a 
composite ranking of every student teacher. 
Each departmental group of student teachers 
was ranked from one to the number in the 
department. This criterion has one weak- 
ness: not all of the teachers who helped 
make the composite saw all of the teachers 
teach. Thus while each critic might well be 
able to place each one with reference to those 
he knew, the absence of an absolute stand- 
ard made it difficult to place those with whom 
he was not so well acquainted. 

On personality, each student was assigned 
to one of six categories, from A to F. Each 
teacher who observed the student teach made 
an independent rating. Instructions for rat- 
ing, and a definition of teaching personality 
were sent each teacher. 

The University High School Rating Scale 
made provision for rating teachers on each of 
ten items: personal grooming, personality, 
loyalty, vitality, knowledge of subject matter, 
organization of subject matter, skill in method, 
achievement of pupils, discipline, and poten- 
tiality. 

* Miller Analogies Test for Graduate Students, by W. S. 
Miller, University of Minnesota, Minneapolis, Minnesota. 


* Minnesota College Aptitude Tests, by the Department of 
Psychology, University of Minnesota. University of Minnesota 


Press. 
_* Minnesota Reading Test, by M. E. Haggerty and A. C. 
Eurich. University of Minnesota Press. 


FACTORS IN TEACHING PERFORMANCE 253 


REDUCTION OF DATA INTO WORKING UNITS 


In order to facilitate the work of tabulating 
the data and giving them statistical treatment, 
all scores on the qualifying examinations were 
reduced to percentile ranks. By means of ap- 
propriate tables these were converted into 
standard deviation scores for tabulation and 
for combining the examinations in order to 
get a composite score. Honor point ratios 
and ratings on the University High School 
Rating Scale did not receive this treatment. 


RELATIONSHIP WITH THE CRITERIA 
(1932-33 GROUP) 


For the purpose of studying the validity of 
the qualifying examinations, all variables were 
divided into two groups: conditioning factors 
and criteria of teaching success. The former 
consisted of the qualifying examinations, ap- 
titude tests, hours of credit, and honor point 
ratio. The criteria consisted of judgments of 
student teaching success by critic supervisors. 

The factors which correlated most highly 
with student teaching rank were hours of 
credit in all subjects, general honor point ra- 
tio, and honor point ratio in the major. The 
high correlation between hours of credit and 
student teaching rank was due, most prob- 
ably, to the fact that all data were collected at 
the end of the students’ third calendar year in 
the University, and “hours of credit” means 
the number of hours earned up to that time. 


TABLE I 


COEFFICIENTS OF CORRELATION BETWEEN CON- 
DITIONING FACTORS AND JUDGMENTS OF 
STUDENT TEACHING PERFORMANCE 


(1932-33 GrouP) 


Rank Conditioning Factor r 
1. Hours of credit in all subjects... .46* 
2. General honor point ratio.___---~~- 46 
3. Honor point ratio in the major... .45 
4. Composite score (qualifying exam- 

IN: . soctcada lala eis tosahbaabahaadhaiteants ‘ 
5. Credits in education._..._.....-.. .30 
6. Honor point ratio in English__.--~- .29 
7. College Aptitude Test scores_.__-- 27 
8. Honor point ratio in education... .27 
9. Total education scores (qualifying 

IE invicicecintin eonuioutintninis 
10. Minnesota Reading Test scores... .25 
11. Total English scores (qualifying 

| (SS aes .22 
13. Hie Benoel Benk ............... 21 
13. Miller Analogies Test scores____-- .20 
*The probable errors of these coefficients 


were all low, ranging from .03 to .05. 





—_— <= 





— 


a 2 


254 JOURNAL OF EXPERIMENTAL EDUCATION 


A large number of hours reflected a heavier 
than average load and a small number (if 
any) of hours failed—both measures of schol- 
arship and indirectly of intelligence. The co- 
efficient of correlation between ratings on 
teaching success made by the same supervisors 
was .67 for the combined secondary school 
groups. The lack of complete correspondence 
between these two criteria may be attributed 
in part to the lack of reliability of the two 
sets of measures. 


The majority of the _ inter-correlations 
among the conditioning factors were higher 


than those between the conditioning factors 


and the criterion. Most of these were be- 
tween the limits of .30 and .54, the median 
being .43. 

Major Subject groups. The variation be- 
tween the coefficients of correlation between 
any conditioning factor and student teaching 
rank was considerable for the various depart- 
ments. For example, the coefficients between 
honor point ratio in the major subject and 
student teaching rank for the music group was 
.51, while for the kindergarten and nursery 
school group it was only .o4, and —.o7 for 
physical education for men. The coefficients 
of correlation between the composite qualify- 
ing examination scores and the criterion varied 
from .17 to .44, the average being .30; and 
with college Aptitude Test scores the varia- 
tion was from .06 to .46, the average .22 (Ta- 
ble IT). 

These wide variations among the depart- 
mental groups make the problem of predic- 
tion more difficult. They reveal clearly that 
a general regression equation for all depart- 
ments combined has less value than studies 
made separately of each department. A con- 
ditioning factor which would be of great value 


[Vol. 5, No. 3 


in one department would not necessarily be oi 
equal importance in another. 

The inter-correlations between the condi- 
tioning factors were in general higher than the 
coefficients of correlation with the criterion, 
though some were lower. 


RELATIONSHIP WITH THE CRITERIA 
(1933-34) 

The criterion selected for an evaluation of 
student teaching for the 1933-34 group gave 
decidedly lower correlations with conditioning 
factors for the entire secondary school than 
for the 1932—33 student teachers. The cri- 
terion for the 1933-34 group was the Uni- 
versity High School Rating Scale. 

For the entire secondary school group the 
coefficients of correlation between English, 
Education, Miller Analogies, College Aptitude 
Test, and honor point ratio for all subjects 
ranged from .03 to .19 (Table III). They 
were definitely lower than the coefficients de- 
rived from the same variables for the previ- 
ous year, the range for the latter being .20 to 
.46. The one variable correlating equally 
closely in both groups was the Miller Analo- 
gies Test scores, .20 for 1932-33, and .19 for 
1933-34- 

As explanatory of this marked difference 
the following may be suggested: The rank 
order lists of student teachers made in 1932— 
33 were composite lists made in a group meet- 
ing by all critics who came in contact with the 
student teachers. If one teacher, through 
bias, halo influence, or prejudice, attempted to 
rate a student too high or too low, it was off- 
set by other critics. As contrasted with this 
method, the criteria for 1933-34 consisted of 
independent ratings (employing the Univer- 


COEFFICIENTS OF CORRELATION BETWEEN CONDITIONING FACTORS AND JUDGMENTS OF TEACHING 
SUCCESS FOR VARIOUS DEPARTMENTAL GROUPS (1932-33) 


Composite Honor College Miller 
Department Number Score Point Aptitude Analogies 

of (Qualifying Ratio Test Test 

Cases Examina- (Major) Scores Scores 

tions) 

Commercial Education___________ 35 .25 . 32 . 26 . 50 
SS ean 64 .17 .31 .29 . 08 
a ee 48 .33 . 50 . 46 ot 
Home Economics______________- 60 . 32 . 50 . 06 . 02 
Kindergarten and Nursery. _____- 38 . 30 04 20 14 
i 48 . 44 . §1 i . 30 
Physical Education (Men)______- 32 . 30 —.07 . 24 .10 


Physical Education (Women)__-__ 30 . 30 .41 my | .10 








dent 
serve 
were 
ratir 
W 
corr 
and 
were 
to t 
cies 
fror 
or 
rati 


Con 








ee 





March, 1937] 


dent teacher by each critic teacher who ob- 
served him teach. For those students who 
were observed by more than one critic, the 
ratings were combined by averaging. 


With but few exceptions the coefficients of 
correlation between the conditioning factors 
and the criterion for the major subjects groups 
were negligible, which was in marked contrast 
to the previous year. These erratic tenden- 
cies suggest that either departmental groups 
from one year to the next are not comparable, 
or little confidence can be placed in the 
ratings. 


TABLE III 


COMPARISON BETWEEN COEFFICIENTS OF CORRE- 
LATION WITH THE GENERAL CRITERIA FOR 
ALL SECONDARY SCHOOL GROUPS FOR 
THE 1932-33 GROUP AND THE 
1933-34 GROUP 

1933-34 
1932-33 University 
Student High School 


Conditioning Factor Teaching Rating 
Rank Scale 

Education (Qualifying 

Examination) ~..----- .26* .05* 
English (Qualifying Ex- 

| ee .22 14 
Miller Analogies Test 

NE a eiettidtemein .20 19 
College Aptitude Test 

ae 27 .03 
Honor Point Ratio—all 


AS eae 46 12 


* The P. E.’s of these coefficients range from 
.03 to .05. 


FACTORS IN TEACHING PERFORMANCE 


te 
uw" 
un 


CONCLUSIONS 


From a study of the relationships between 
the various conditioning factors and the cri- 
teria, the following conclusions may be de- 
rived: 

1. Coefficients of correlation between the 
conditioning factors and the criteria were not 
high enough to predict the performance of in- 
dividuals with any great improvement over 
chance. The highest predictors were honor 
point ratio and hours of credit, (r= .45 
and .46). 

2. The majority of the coefficients of cor- 
relation between the conditioning factors and 
the criteria were positive. In some depart- 
ments the relationships were close to zero. 

3. Of the two methods used as measures of 
student teaching performance, rank order, and 
the University High School Rating Scale, the 
rank order lists, made by all critic teachers 
who knew the student teachers, had a higher 
degree of relationship with the conditioning 
factors. 

4. Qualifying examinations seem to be less 
valuable than honor point ratio in predicting 
teaching performance, and add very little to 
accuracy in predicting teaching success from 
honor point ratio alone. 

5. The coefficients of correlation for all 
large departmental groups between the major 
A examinations and student teaching rank 
were less than those of the major B examina- 
tions, the average of the former being .19 and 
of the latter .25. 








eee 


as 





THE RELATIVE PREDICTIVE VALUE OF CERTAIN 
COLLEGE ENTRANCE CRITERIA* 


HERBERT A. LANDRY 


The Problem. ‘This study has attempted 
to establish the relative value of four college 
entrance criteria for the prediction of both av- 
erage freshman scholarship and scholarship in 
specific subject-matter fields; i.e., English, 
languages, social science, natural science, and 
mathematics. The entrance criteria or ele- 
ments used in the study included (1) secon- 
dary-school, grade 12, final subject marks, (2) 
the marks on the examinations of the College 
Entrance Examination Board, (3) the scores 
on the tests of the Codperative Test Service, 
and (4) the scores on tests of scholastic apti- 
tude (verbal and mathematical) of the College 
Entrance Examination Board. 

Data Used. The data used were obtained 
from the records of 416 boys who were among 
the June 1932 graduates of sixteen inde- 
pendent (private) secondary schools of east- 
ern United States. These boys entered the 
freshman class of three eastern colleges the 
following September. In addition to the sec- 
ondary-school marks, examination marks, and 
the test scores previously mentioned, there 
were available the final marks obtained in the 
various college freshman subjects. These 
data were provided by the Educational Rec- 
ords Bureau of New York City. 

Treatment of Data. A preliminary survey 
indicated the existence of differences in the 
meaning of the marks given by the various 
secondary schools. Adjustment procedures 
were applied for the purpose of equating these 
marks, since such variations would tend to 
confuse relationships and result in the restric- 
tion of predictions on the basis of real ability. 
This adjustment was made by projecting the 
mean marks of grade 12 in each school upon 
a mental-age scale derived from group intelli- 
gence tests administered to the students dur- 
ing the senior year. Each mark-group in each 
school thereafter was represented by the me- 
dian of the mental ages of those who received 
marks that fell within that group. 

Conversion of the college marks as reported 
to a uniform and comparable scale was neces- 


* Abstract of a thesis submitted in partial fulfillment of the 
requirements for the degree of Doctor of Philosophy in the 
School of Education of New York University, 1936. 


sary because of the differences which existed 
between the marking systems used in the three 
colleges. No adjustment of marks for intel- 
lectual selection was made, since investigation 
showed that there was no apparent difference 
in the quality of student material in the fresh- 
man classes of each of the colleges. Compar- 
ability of marks was established by represent- 
ing each mark-group by the mean sigma value’ 
of that group, assuming a normal distribution 
of data. 

The noncomparable scores on the tests of 
the Codperative Test Service were converted 
into a comparable series by means of the pro- 
cedure suggested by Hull.? This involved the 
conversion of the raw scores of each test 
series into a new series of T scores, in which 
the score value of the mean was arbitrarily 
set at 50 score points with a standard devia- 
tion of ro. The scores on the Scholastic Ap- 
titude Test and the marks on the subject-mat- 
ter examinations of the College Entrance Ex- 
amination Board were used as reported. 


Method of Analysis of Data. The relative 
predictive value of the various entrance ele- 
ments for freshman general success and for 
success in specific subject-matter fields was de- 
termined by comparing the Pearson Product- 
Moment coefficient of correlation obtained be- 
tween these elements and freshman marks. 
Correlation tables were made first with data 
from each college separately and then with the 
combined data, so that analyses of the distri- 
butions of data might be made. Preliminary 
to the use of the correlation technic, data used 
were tested for linearity by the use of Blake- 
man’s Test of Linearity. This test indicated 
that while these data were not all strictly 
linear, the coefficients of correlation would 
provide an adequate measure of relationship. 


The Findings. Efforts were first directed 
toward the determination of the relationships 
existing between the means of the final fresh- 
man marks and various elements used for col- 
lege entrance. The coefficients of correlation 


2C. L. Hull, Aptitude Testing. New York: World Book 
Company, 1928, pp. 396-400. ; 

tk. i; Holzinger, Statistical Methods for Students in Educa- 
tion. oston: Ginn and Company, 1928, pp. 221-224. 


256 











OOA’™M 


fn fn gals) 








—_—_w vin sts '§ we 








March, 1937\ 


which were computed to indicate these rela- 
tionships are shown in Table I. It will be 
seen that the most marked relationship (7 = 


TABLE I 


COEFFICIENTS OF CORRELATION (ZERO ORDER) 
OBTAINED BETWEEN PREDICTIVE ELEMENTS 
AND FRESHMAN SCHOLARSHIP, USING 
THE COMBINED DATA OF 
THREE COLLEGES* 


r PE. N 
Mean of Freshman Marks 
and: 
Sec.-schoo] unadjusted mean 
RE See 562 .023 416 
Sec.-school adjusted mean 
ree eae .625 .020 416 
ee A” es 460 .028 364 
C.E.E.B. examination mean 
ES es a oe 491 .027 358 
C.T.S. test mean scores ----- .573 .023 403 


English Marks and: 
Sec.-school adjusted marks... .475 .028 340 


fi fea 5138 .027 339 
C.E.E.B. examination marks. .345 .033 309 
ih i eae 475 .028 364 


Language Marks and: 
Sec.-school adjusted marks___ .496 .028 330 


_f* 4f eee 401 .029 347 
C.E.E.B. examination marks. .395 .030 354 
i? Se fl — ee 483 .025 392 


Social Science Marks and: 
Sec.-school adjusted marks___ .403 .046 152 


fo {”  _ =e 397 .034 273 
C.E.E.B. examination marks. .431 .040 182 
C.F. GE BERNE cncmnaduwne A478 .037 200 


Natural Science Marks and: 
Sec.-school adjusted marks... .421 .037 222 


(fi eee 248 .037 293 
Pe A eee 182 .039 286 
C.E.E.B. examination marks. .260 .042 224 
yb eee 432 .040 187 


Mathematics Marks and: 
Sec.-school adjusted marks__. .516 .043 132 


pa a eee 491 .036 204 
C.E.E.B. examination marks. .426 .039 197 
beh gt A a eae 478 .039 181 


*In this table and elsewhere: 
V.A.T.=Verbal Aptitude Test of the College 

Entrance Examination Board. 
M.A.T.=Mathematical Aptitude Test of the 

College Entrance Examination Board. 
C.E.E.B.=College Entrance Examination 

Board 
C.T.S.=Cooperative Test Service. 

All, or nearly all, of the examinations and 
tests of the C.E.E.B. and the C.T.S. in the 
subjects of instruction covered by this study 
were used (39 different examinations of the 
C.E.E.B. and 20 tests of the C.T.S.). 


PREDICTIVE VALUE OF CERTAIN CRITERIA 257 


.625) was found with the mean of the ad- 
justed secondary-school (grade 12) marks. 
It will also be noted that the same marks, 
when they were not adjusted for differences 
in the marking systems and the factor of intel- 
lectual selection, provided a correlation of 
.562. This value was practically the same as 
that obtained with the mean of the C.T.S. test 
scores, .573. In cases where adjustment data 
were not available, this study showed that 
either element, the mean of the secondary- 
school unadjusted marks or the mean of the 
C.T.S. scores, could be used with the same 
degree of effectiveness for predictions of gen- 
eral freshman scholarship. The V.A.T. scores 
and the mean of the C.E.E.B. examinations 
were found to be less effective as devices for 
the prediction of average freshman scholar- 
ship. 


An examination of the coefficients of cor- 
relation found between the predictive elements 
and freshman marks in specific subject-matter 
fields will indicate that differential predictions 
were more restricted than were predictions 
for general scholarship. It will also be noted 
that considerable variation existed in the de- 
gree to which scholarship in the different types 
of subject matter was predicted. No one pre- 
dictive element was distinctive in that it pro- 
vided consistently higher correlations. How- 
ever, one element, the secondary-school ad- 
justed marks, provided the highest correla- 
tions in two of five subjects, while another ele- 
ment, the C.E.E.B. examinations, provided the 
lowest correlations in three of five subject- 
matter fields. The C.T.S. tests provided the 
most consistent results, since in four subject- 
matter fields the variation in the size of the 
coefficients of correlation was less than .o1o. 
The fifth differed by only .oso. The scores 
on the V.A.T. provided the highest correla- 
tion in the series for English. 


The secondary-school adjusted marks on 
English and the Codperative Test Service Eng- 
lish Test had practically the same predictive 
value, while the examinations of the C.E.E.B. 
provided the poorest prediction of the group 
of four. 


In the prediction of the languages, the sec- 
ondary-school adjusted marks in the languages 
provided the highest correlation, .496, which 
was practically the same as the coefficient of 
.483 obtained for the C.T.S. tests. The rela- 
tionships between the C.E.E.B. examination 
marks in language and the V.A.T. scores were 

















258 JOURNAL OF EXPERIMENTAL EDUCATION 


similar in degree and somewhat more limited 
than the former two. 

The scores on the C.T.S. tests in social 
science provided the highest correlation, .478, 
with freshman final marks in this subject- 
matter field. The secondary-school adjusted 
marks and the scores on the V.A.T. provided 
correlations that were very similar in size, 
.403 and .397, respectively. These were, how- 
ever, lower than those obtained with the 
C.T.S. test scores and the C.E.E.B. examina- 
tion marks. The correlation obtained with 
the secondary-school marks was the lowest ob- 
tained for this element in the differential pre- 
diction series. 

The relationships found for natural science 
were as a whole more limited than those found 
for any other subject. The correlations ob- 
tained with three of the elements; i.¢., the 
V.A.T. and C.T.S. scores and the C.E.E.B. 
marks, were the lowest for each of these ele- 
ments in the differential prediction series. 

There was little difference in the predictive 
value of both the secondary-school adjusted 
marks and scores on the C.T.S. tests in the 
natural sciences. These two entrance elements 
provided the highest correlations for this sub- 
ject. They were .421 and .432, respectively. 
Similarly, there was little difference in the 
predictive value of the V.A.T. scores and 
C.E.E.B. examination marks for this subject. 
However, their value was much more limited 
than the former two, since the coefficients 
obtained and freshman final marks were .248 
and .260, respectively. The M.A.T. provided 
the lowest correlation, .182. The relation- 
ships found in the case of the last three en- 
trance elements were so limited that they 
would have little if any practical value in the 
prediction of freshman final marks in the nat- 
ural sciences. 

The correlations obtained in mathematics 
were higher as a group than were those for 
any other subject studied. The highest corre- 
lation, .516, in this series was found between 
the secondary-school adjusted marks in mathe- 
matics and freshman final marks in the same 
subject-matter field. The scores on. the 
M.A.T. and C.T.S. tests were only slightly 
less effective, the correlations with these ele- 
ments being .491 and .478, respectively. The 
C.E.E.B. examinations provided the lowest 
correlation, .426. These findings indicate that 
the aptitude test again has provided a higher 
relationship with freshman marks in a par- 
ticular subject than has one of the content 


[Vol. 5, No. 3 


examinations, the C.E.E.B. examination. It 
has also indicated that for practical purposes 
of prediction, the secondary-school adjusted 
marks and the scores on the C.T\S. tests and 
M.A.T. have relatively the same value. 


Intercorrelation of Entrance Elements. The 
intercorrelations obtained between the en- 
trance elements were found to be generally 
higher than the correlation obtained with the 
freshman final marks. These are shown in 
Table II. 

The highest intercorrelation was found be- 
tween the mean of the C.T.S. test scores and 
the V.A.T. scores. This correlation, .709, pro- 
vides further evidence of the marked commu- 
nity of function existing between tests of scho- 
lastic aptitude and achievement. Since the 
group here used was a selected sample of the 
total on whom the test norms were based, the 
value .709 should be corrected to the range of 
the larger group. Doing this and correcting 
for attenuation in addition, it becomes .gro. 

The C.E.E.B. examinations and C.T-S. tests 
correlated higher with the secondary-school 
marks than with freshman marks. These cor- 
relations indicated that they were better meas- 
ures of secondary-school achievement than 


TABLE II 


INTERCORRELATIONS BASED UPON COMBINED 
DATA OF THE THREE COLLEGES 


Fresh. Sec.-Sch. Ave. of 

Final Adjusted V.A.T. C.E.E.B. 

Ave. Ave. Scores Marks 
Sec.-sch. r .625 
: P.E. .020 
adj. ave. 416 


r .460 497 

pinay P.E. 028 .026 
N 364 364 
T 


Ave of 491 -658 529 


C.E.E.B. P.E. .027 .021 .033 
marks N 358 358 320 
Ave of r .573 .665 -709 .658 
C.T.S. P.E. .023 .019 .018 .021 
scores N 403 403 342 319 


they were predictors of freshman marks. 
Relatively little difference was found, how- 
ever, in the case of the V.A.T. scores. 


Variations in Prediction Found Among the 
Three Colleges. A study of the correlations 
obtained for the separate colleges brings out 
interesting variations in the predictive value of 
the different elements in each of the three col- 
leges. 











— pa 2 eee a Co ae ea 








VO. 3 


_ 
. 


It 
poses 
usted 
3 and 


The 
+ en- 
rally 
h the 
m in 


1 be- 
; and 
_ pro- 
nmu- 
scho- 
» the 
f the 
|, the 
ge of 
cting 
.QI0. 
tests 
‘hool 
-cor- 
1€as- 
than 


. of 


rks 


658 
021 
319 


irks. 
10W- 


the 
ions 
out 
ie of 
col- 








March, 1937) 


More marked variations were found with 
the secondary-school final marks and the ex- 
aminations of the C.E.E.B. than with the 
scores on the V.A.T. and C.T.S. 

[he marks on the C.E.E.B. examinations 
provided the greatest variations of prediction 
among the three colleges in four out of five 
subject fields. On the other hand, the scores 
on the C.T-S. tests provided the greatest uni- 
formity of prediction in four out of five sub- 
jects. 

General Conclusions. 1. Differential pre- 
dictions of freshman scholarship are more re- 
ee than general predictions. 

The best entrance element for general 
srctietlen is the mean of the secondary-school, 
grade 12, adjusted final marks. The other 
elements ranked in the order of their predic- 
tive value were the mean of the C.T.S. test 
scores, the V.A.T. scores, and the mean of 
the C.E.E.B. examination marks. 


3. The secondary-school adjusted marks 
were found to be the best entrance element 
for differential predictions, with the C.T.S. 
test scores ranking a very close second. The 
V.A.T. scores are next in value, with the 
marks on the C.E.E.B. examinations provid- 
ing the poorest prediction. — 

4. Considerable variation exists in the pre- 
dictive value of the different entrance elements 
in different subject-matter fields and in the 
different colleges. 


Discussion and Recommendations. The 
coefficients found between the various predic- 
tive elements and the criterion do not repre- 
sent the “true” relationships that exist. The 
observed relationships are subject to various 
limitations, which have served to reduce the 
obtained coefficients from the “true” relation- 
ships of the variables studied. The most sig- 
nificant of these factors were (1) the relative 
homogeneity of the groups studied, (2) the 
lack of perfect reliabilities of the predictive 
elements, (3) the limitations of the criterion, 
and (4) the maladjustments of students to 
college. 

Within these limits, this investigation has 
added further evidence to the fact that none 
of the college entrance elements studied can, 
taken singly, predict accurately an individual’s 
general scholastic success in his freshman col- 
lege year. The limitations in predicting 
scholarship in specific subject-matter fields 
is even more marked. Increases in predictive 
efficiency have been reported by various in- 


PREDICTIVE VALUE OF CERTAIN CRITERIA 


259 


vestigators as having been brought about by 
combining the various entrance elements in 
predictive formulas. Even when this is done, 
the forecasting efficiencies of the best combi- 
nations used rarely exceed 35 per cent. It is 
evident, therefore, that there is yet great need 
for improvement of both admission technics 
and the measurement of college achievement. 


The numerous recent investigations of the 
value of certain personal traits for college suc- 
cess are indicative of the realization that the 
present entrance elements do not provide as 
complete a description of the individual as is 


to be desired. There is need for the develop- . 


ment of new, and the improvement of exist- 
ing, measures of personality factors found to 
be related to college success. 


The cumulative guidance record, with its 
information concerning many aspects of the 
individual, as well as showing trends of 
growth and development, can, it is believed, 
go far in filling the gap that exists in the 
information necessary for more accurate pre- 
diction. In providing such a background, it 
may well serve to supplement statistical pre- 
dictions made with the more commonly used 
entrance elements and therefore result in more 
accurate forecasting of freshman college 
achievement. 


Source Material. The references listed be- 
low have been selected from among the re- 
ports of the more significant investigations 
that have been carried on in the field of the 
prediction of college achievement. 


Reports of the Chairman of the Commission 
on Scholastic Aptitude Tests. New York: 
College Entrance Examination Board, 
1927-1934. 

Crawford, A. B., and Burnham, P. S., “En- 
trance Examinations and College Achieve- 
ment,” School and Society, XXXVI (Sep- 
tember 10 and 17, 1932), Pp. 344-352, 
378-384. 

Douglass, H. R., and others, Relation of High 
School Preparation and Certain Other Fac- 
tors to Academic Success at the University 
of Oregon, University of Oregon Publica- 
tion, III, 1, 1931, 61 pages. 

Edgerton, H. A., and Toops, H. A., Academic 
Progress, A Four-Year Follow-Up Study of 
Freshmen Entering the University in 1923. 
Columbus, Ohio: Ohio State University, 
1929, 150 pages. 


— 





——— a 


id 





> a 
EE 


260 JOURNAL OF EXPERIMENTAL EDUCATION 


Jones, E. S. (editor), “Studies in Articula- 
tion,” University of Buffalo Studies, IX, 
1934, Xili +- 319 pages. 

Kornhauser, A. W., “Tests and High School 
Records as Indicators of Freshman Success 
in an Undergraduate School of Business,” 
Journal of Educational Research, XV1 (De- 
cember 1927), pp. 342-356. 

Odell, C. W., “Predicting the Scholastic Suc- 
cess of College Students,” University of II- 
linois Bulletin, XXVIII, 5, 1930, p. 43. 

Reeves, F. W., and Russell, J. D., Admission 
and Retention of University Students, Chi- 


cago, Ill. University of Chicago Press, 
1933, 360 pages. 

Segel, D., “Prediction of Success in Junior 
College,” Junior College Journal, 1 (May 
1931), Pp. 499-502. 

Stoddard, G. D., “Iowa Placement Examina- 
tion,” University of lowa Studies in Edu- 
cation, III, 2, 1925, p. 103. 

Whitman, A. D., The Value of the Examina- 
tions of the College Entrance Examination 
Board as Predictions of Success in Colleg: 
New York: Bureau of Publications, Teach- 
ers College, Columbia University, 1926, 
vili + 77 pages. 


' 
' 
| 4 
' 


[Vol. 5, No. 3 | 


' 





DC 








No. 


3 
Press, 


Junior 
(May 


mina- 
Edu- 


mina- 
nation 
alle ge 
“each- 
1926, 





; 





f 
: 
: 


DO MARKING SYSTEMS BASED UPON THE NORMAL PROB.- 
ABILITY CURVE INSURE AN EQUITABLE DISTRIBUTION 
OF MARKS IN ELECTIVE CURRICULA’? 


KARL C, PRATT 


Central State Teachers College 
Mt. Pleasant, Mich. 


and 


VirGcIL WISE 


Public Schools 
Glencoe, Ill. 


INTRODUCTION 


The history of marking systems tends to 
conform to the doctrine of the cyclical or 
spiral progression of human ideas and theories. 
Prior to the critical studies of Cattell, Meyer, 
Dearborn and others during the first decade 
of the century the prevailing marking system 
was the percentage system with a fixed point 
on the scale serving as the crucial or “passing” 
mark. The percentage of pupils allowed to 
fall below this point of stress was determined 
largely by personal or social exigencies, as was 
the case with respect to the distribution of 
individuals about the other less well defined 
nodes. Such a ‘system’ has generally been 
termed the ‘absolute’ marking system. An 
obvious lack of uniformity is inherent in such 
a procedure, since the marks will be influ- 
enced by whims, prejudices, and lack of pro- 
fessional training in formulating course objec- 
tives and in securing effective sampling of sub- 
ject matter. In short, a mark which was sup- 
posed to represent attainment in a given field 
was largely determined subjectively, aided by 
a certain skill in juggling. Thus did an 
“evaluation” of personality enter into stu- 
dents’ marks in those days. 


A system of this type could not long endure 
when confronted by the logic of those who 
pioneered the “relative” system of marking. 
One of the first institutions to inaugurate this 
was the University of Missouri from whence 
is derived the common reference to “mark- 
ing according to the curve” as the “Missouri 


* The major points of this article were advanced by the senior 
writer in a paper nted in the session on Psychometrics of 
the Midwestern Psychological Association in April 1936 at 
Northwestern University. The writers are indebted to Dr. 
H. ae and to Dr. T. C. Schneirla for many helpful 
suggestions. 


system.” In essence such systems are based 
upon the assumption that academic achieve- 
ment, in common with many other human 
traits, tends to distribute itself in the propor- 
tions or symmetrical form of the normal prob- 
ability curve. Those ordinates of the curve 
which serve to set off the letter or grade cate- 
gories vary somewhat in different schools. . 
However, the relative distribution is funda- 
mentally the same in 5-category systems. In 
this manner a gross uniformity in the propor- 
tion of the different marks is established and 
limits or bounds are set to the judgment of 
the instructor. 

The fact that the measuring devices or 
criteria available to the teacher for evaluating 
achievement were all too frequently inade- 
quate, invalid and unreliable led to the con- 
struction and standardization of objective 
tests. However the use of objective tests in 
determining promotions, scholarships, honors 
and other rewards has been quite limited, a 
circumstance usually explained by the euphe- _ 
mism that they are “too inflexible.” Some 
have urged a dual bookkeeping system, in- 
volving one set of marks which is based on a 
relative or absolute basis for home consump- 
tion, with actual achievement to be deter- 
mined by standardized tests for school rec- 
ords, for transfer purposes, and the like. 

Students and many teachers have reacted 
adversely to the artificiality of academic marks 
as sources of motivation, on the ground that 
they are purely extrinsic and probably also 
because of an urge to be rid of the unpleas- 
ant spur of competition to effort. Many who 
take this position have advocated a return to 
a crucial, fixed nodal point which will sep- 
arate the “satisfactory” from the “unsatisfac- 


261 








“ ——————————— 





Se 





262 JOURNAL OF EXPERIMENTAL EDUCATION 


tory” without invidious comparisons of differ- 
ential performance in the first category. Sin- 
gularly enough, in the experience of the writ- 
ers, most students who advocate the elimina- 
tion of competition in academic life are quite 
unenthusiastic about the elimination of com- 
petition from their future economic life as 
teachers. This is indeed a curious inconsis- 
tency. 

At present there is a tendency in certain 
educational circles to minimize @ny compara- 
tive measurement of knowledge and to stress 
the measurement of growth of the individual 
with reference to his own earlier record, or 
preferably, to place greatest stress upon what 
are usually termed “intangibles” (these by 
virtue of their intangibility being considered 
to have greatest importance). Thus we find 
measurement (sic) of personality to be one of 
the current educational fads. In this process 
we witness the turn of the cycle. Perhaps 
we should call this “turn of the spiral’’ instead, 
since we note that now in place of a somewhat 
unrestrained subjectivity in marking there is 
created a spurious certainty through the use 
of such devices as personality inventories, 
questionnaires, scales, tests of interests, be- 
havior journals or anecdotal records, etc. 

In regard to the criticisms which have been 
advanced against marking systems which are 
based upon the normal probability curve Fin- 
kelstein (5) raised the first major objection 
to the postulate of a “normal”’ distribution of 
achievement on a college level. It was his 
contention that achievement, as judged from 
an analysis of 20,348 academic course marks 
made by students at Cornell University, was 
skewed negatively and that the 5-category 
system from lowest to highest should be in 
the proportions 12-19-45-21-3 rather than 
Meyer’s 3-22-50-22-3 proportions. It is to be 
observed that Cornell was operating upon a 
fixed passing mark basis at that time and that 
consequently Finkelstein’s distribution of 
marks was necessarily influenced by the sys- 
tem in vogue. 

In this investigation we are not concerned 
with the determination of the “proper” pro- 
portions of letter grades within a system which 
is based upon the normal probability curve, 
concerning which there is continuing con- 
troversy as shown by the recent articles of 
Eells (3, 4) and Davis (2). Nor can we see 
that the Oberlin system of marking by per- 
centile ranks, as reported by Meyer (6), is 
any more workable if it can be demonstrated 


| Vol. 55 No. 3 


that there is no valid reason for assuming that 
achievement distributes itself normally 
throughout the courses of a college system. 


An early and a continuing criticism of the 
so-called “curve” systems is that they may not 
be valid in comparatively small groups and 
that they become ridiculously unworkable in 
very small classes. Standard texts which of- 
fer the teacher advice on this problem offer 
consolation by suggesting that the teacher 
build up cumulatively a larger group and then, 
in time, judge a given class by reference to 
the larger group. The feasibility of this prac- 
tice depends, of course, upon the virtual stand- 
ardization and the long time use of test ma- 
terials. 


Even in the latest and very comprehensive 
monograph in this field by Bohan (1) there 
seems to be little expressed recognition or em- 
phasis that the existence of rather constant 
selective factors operating to provide signifi- 
cantly different distributions of intelligence of 
those electing the several courses in conjunc- 
tion with the Missouri marking system, or 
variant thereof, cannot produce marks having 
the same value or significance no matter how 
faithfully the system is adhered to. Differ- 
ential selection of students must make any 
fixed marking system unworkable. 


THE PROBLEM 


In this investigation the authors have en- 
deavored to answer the following questions: 


1. The possibility that selective factors 
(e.g., subject matter, or matters of personnel, 
economic, or curricular nature) rather than 
chance operate to provide a differential elec- 
tion of courses, so that given departments in 
their students do not have a representative 
sampling of intelligence from the freshman 
class. 


2. The possibility that because of this dif- 
ferential selection the academic grading with- 
in the given department may be grossly un- 
fair, even despite the fact or because of the 
fact that the grades are distributed according 
to a “curve” system. It is also possible that 
such unfairness will be partially revealed by a 
discrepancy between marks obtained within 
the department concerned and those obtained 
in other departments by the same students,. 

® By marks ‘within a department’ we refer only to election 


of courses in the department considered. The phrase does no! 
carry the connotation of “ ’ or of “specialization.’ 


Similarly “other’’ or “extra” marks refers to marks made in 
their other work by these same students. 


Te aa cae done a a 


Ma 


twe 
trib 
ma} 
pro 


ol ¢é 








No. 3 


g that 
‘mally 
m. 
of the 
Ly not 
S and 
ble in 
ch of- 
offer 
acher 
then, 
ice to 
prac- 
stand- 
t ma- 


nsive 
there 
r em- 
stant 
ignifi- 
ice of 
junc- 
n, or 
aving 
- how 
iffer- 
» any 


e en- 
ions: 
ictors 
mnel, 

than 

elec- 
its in 
ative 
aman 


3 dif- 
with- 
y un- 
f the 
rding 
that 
by a 
‘ithin 
ained 
nts.,. 


lection 
es not 
ation.’ 
ade in 








“PSE Fe eer Agr eG 


March, 1937] 


3. The possibility that this discrepancy be- 
tween the selection of intelligence and the dis- 
tribution of marks in selected departments 
may be revealed by comparing the relative 
proportions of intelligence and marks in terms 
of a 5-category system. 


PROCEDURES 


The population which was studied in this 
survey comprised 1550 students in six con- 
secutive annual freshman classes of a teach- 
ers college from 1929 to 1934 inclusive. 

The raw scores obtained by administering 
the American Council Psychological Exam- 
ination were converted into equivalent 1931 
scores by means of the national percentiles. 
Distributions of these scores were made ac- 
cording to individual representation rather 
than according to the number of courses 
elected within the department during the 
freshman year. This step appeared neces- 
sary because the number of courses in depart- 
mental sequence during the freshman year 
varies from department to department and, 
to a certain extent, from one portion of the 
time span of our study to another. Also we 
are concerned with the total distribution for 
the entire freshman year and not with intra- 
annum changes in the departmental selections. 
Representation of intelligence according to the 
total number of courses elected would tend ac- 
cording to circumstances to increase, to de- 
crease or to obscure the initial selective differ- 
ences. This would follow from the fact that 
departments which selected students of high 
ability would tend to eliminate the less able 
individuals, and in this way would soon leave 
a much more select group at the end of the 
year. On the other hand, departments which 
selected students of lower ability would tend 
to retain not only a greater proportion of their 
own initially inferior selections but also to re- 
ceive in the second or third term students who 
have been eliminated or discouraged from con- 
tinuing in the more selective departments. 

The academic marking system employed in 
the institution studied is the 5-category type 
with the A, B, C, D & E marks correspond- 
ing to the following proportions: 5-10; 20-30; 
40-50; 20-10; 10-5. This provides either 
the Mendenhall symmetry or a negative skew- 
ness, although little attempt has been made to 
enforce even these generous limits. Inasmuch 
as these provisions outlined above are sup- 
posed to apply to departmental elections as a 
whole (including both lower and higher level 


MARKING 


SYSTEMS 263 


courses) there is a tendency to make allow- 
ance for the continued selection of upper class- 
men or of majors first by awarding a greater 
proportion of high marks to upper classmen, 
and second, by distributing the freshman 
marks in a symmetrical or even a positively 
skewed distribution. 

In this study, the common point system 
ranging from 3-1 (A-C) with O-value for both 
D and E is discarded in favor of a 5-point 
rating, ie. A—5 to E=1. The individ- 
ual’s scholastic work within a given depart- 
ment is obtained by averaging his course 
marks in that department. This value will 
then represent him individually since we are 
attempting to study the academic achievement 
as individually segregated within a depart- 
ment rather than the average of all marks 
(disregarding the individuals making them) 
within a department. This practice is neces- 
sitated by the fact that a variability in the 
number of courses elected within a department 
will reflect somewhat the success of the stu- 
dent in the subject. Thus to average all aca- 
demic marks for the freshman year within a 
given department would tend to give a dis- 
torted picture because of the consequent re- 
moval of certain levels of ability through fail- 
ures, discouragement, etc. Since we are con- 
cerned with the relation of a differential se- 
lection of ability to the distribution of marks 
a comparative picture may be obtained by 
treating the intelligence and the academic 
ratings of each individual on the same basis of 
representation. The measure of achievement 
of the same students in their other elections 
was determined by averaging their marks in 
the other courses elected. 

As a test of the soundness of the above com- 
parison a further comparison was effected by 
taking the two extreme departments and mak- 
ing distributions of intelligence by number of 
courses elected in them, and by similarly rep- 
resenting all marks in academic courses. 

We were also interested in the role that dif- 
ferential selection of intelligence in various 
departments and differential departmental 
practices in marking might play in influencing 
the correlation between intelligence test scores 
and academic marks. This problem was ap- 
proached by finding the correlation between 
students’ scores on the American Council Ex- 
amination and their average academic work 
for the entire freshman year or completed 
fraction thereof. Further we evaluated these 


findings by obtaining correlations between in- 











a 


264 JOURNAL OF EXPERIMENTAL EDUCATION 


telligence and academic work in certain se- 
lected departments, and between the intelli- 
gence of these students and their academic 
work in their other elections. 


RESULTS 
The differential selections of intelligence, 
according to the departmental election of 
courses, are given in Table 1. It will be ob- 
served that the mean of intelligence of all 
freshmen is 128.7 while the median is 122.1. 


This indicates a positive skewness in ability, 
a characteristic which all departmental dis- 


- tributions (saving that of Language) seem to 


have in approximately the same degree. Five 
departments, namely: Agriculture, Early Ele- 
mentary Education, Rural Education, Music, 
and Physical Education (courses giving aca- 
demic credit such as Hygiene, Health Educa- 
tion, etc. but not Physical Training or Gym- 
nasium) are significantly lower, with means, 
respectively, of 113.1, 114.1, 116.6, 120.0, and 
121.7. Two departments, Geography and 
English, with means of 128.7 and 128.8, pre- 
sent almost perfect samplings of the entire 
group. Since English is a required subject 
this department should have virtually the en- 
tire freshman class passing through its courses 
and hence should present a distribution almost 
identical with that of the class as a whole. 
Three departments, History, Physics and 
Chemistry, and Language have means of 
138.4, 143.8, and 156.4, respectively. These 
are significantly superior to 128.7, the mean 
of all the freshmen. 

In Table 2 are presented the characteristics 
of the distribution of the individuals’ average 
academic marks within each department which 
differs appreciably in original ability from the 
level of the group as a whole. Thus 3.19 is 
the average mark in Agriculture made by the 
same individuals listed as Agriculture in Ta- 
ble 1 with an intelligence average of 113.1; 
and the same individuals made an average 
of 2.97 in their non-ag. marks (Table 3). 
One department of representative ability, 
Geography, is included for purposes of com- 
parison. The mean of all freshman marks per 
individual is 3.09, and the median under these 
conditions has the same value. Among the 
included departments the following: Mathe- 
matics, History, Language, and Geography 
(with means of 2.82, 2.86, 2.87, and 2.98, re- 
spectively) are significantly lower in scholar- 
ship than the freshman class as a whole. 
However, two of these departments, History 


[Vol. 5, No. 3 


and Language, have selections of intelligence 
significantly superior to the freshmen as a 
whole and a third department, Mathematics, 
is superior (critical ratio = 2.4). In the 
populations studied only Geography, as pre- 
viously mentioned, has a selection of average 
intelligence. Physical Education (with a 
scholarship rating of 3.05) and Early Elemen- 
tary Education (with 3.08) are not signifi- 
cantly lower than the freshman scholarship as 
a whole, but their selections of ability are sig- 
nificantly lower. Two departments, Rural 
Education (3.12) and Physics and Chemistry 
(3.14) are not significantly higher, while Ag- 
riculture (3.19) is probably significantly 
higher (critical ratio = 3.33) than the schol- 
arship mean of all freshmen. Of these three 
departments, Rural Education and Agricul- 
ture have selections of intelligence signifi- 
cantly inferior to the level of all freshmen, 
while Physics and Chemistry have a signifi- 
cantly superior selection. 

In Table 3 are listed the means of all other 
course marks of individuals taking work in 
the departments represented in Table 2. 
From these results it will be seen that students 
of Agriculture in their other studies do a qual- 
ity of work significantly inferior to their marks 
in Agriculture and also to the general scholar- 
ship of freshman. This indicates that the 
mean of their other marks more truly expresses 
their intelligence. Similarly the extra-depart- 
mental scholarship of those electing courses 
in Rural Education reflects the inferior gen- 
eral ability of the students involved. In the 
case of the Physical Education group the 
mean of the marks outside the department (or 
extra-marks) is not significantly lower than 
the mean of marks within the department, but 
it is significantly lower than the mean of all 
freshman marks. It is thus apparent that 
marking in this department tends to conform 
to the selection of intelligence, since the mean 
of marks within the department is 3.05 and 
of all marks is 3.09 with the critical ratio 
equal to 2.00. In the Geography group the 
mean of the extra-marks is not appreciably 
different from the mean of all freshman 
marks, which clearly shows that their achieve- 
ment corresponds with their rank in intelli- 
gence. The mean of the other marks is, how- 
ever, significantly higher than the mean of 
departmental marks. Hence the latter run 


lower than would be anticipated. Similarly 
it is noticeable that the mean of the extra- 
marks of the students in Mathematics also 








Se ee ee a ee ie, a i ae ae 


Le Se PE EE 














— Dp IRR SRS aaa te ea 





March, 1937| 


corresponds with the selection of intelligence 
in that department (slightly higher than av- 
erage), while the marks within the department 
are significantly lower than the mean of all 
marks. ‘There is no significant difference be- 
tween rank in marks within and outside the 
department of Physics and Chemistry, and 
neither of these values is significantly superior 
to the general scholarship mean even though 
the students are of superior intelligence. The 
means of the extra-marks in History and in 
Language are significantly superior to their 
respective departmental means and to the 
mean of all marks. It is apparent that the 
extra-marks indicate the level of ability of the 
students whereas the departmental marks are 
completely misleading. 

Another means of showing the discrepancy 
of the standing in marks earned within a de- 
partment, as compared with that in marks 
made outside by the same students, is pro- 
vided in Table 4. In this table are listed the 
departmental marks together with the extra- 
departmental means for a selected number of 
departmental groups of students. There is 
also given for each department the percentage 
of the group exceeding the g5th, 75th, 25th 
and sth percentiles of the intelligence distri- 
bution of all the freshmen. From these per- 
centiles as separation points we computed the 
average academic marks for the departments, 
assuming a positive correlation of 1.00 be- 
tween marks and intelligence and giving “A” 
credit, i.e., 5 points, to each individual who 
stood above the 95th percentile of the fresh- 
man class, ““B” credit, i.e., 4 points, to each 
individual who stood between the 75th and 
95th percentiles of the freshman class, and so 
on to complete the Mendenhall distribution 
of marks for each department. It is indeed 
interesting to observe that the hypothetical 
means of the marks obtained are closer to the 
means of the extra-marks than to those of the 
departmental marks (except when these val- 
ues are approximately the same). In connec- 
tion with this comparison it should not be for- 
fotten that the institution does not adhere 
rigidly to the Mendenhall system and that the 
actual mean of all marks is 3.09 rather than 
a hypothetical 3.00. 

It may be argued that representation of 
marks as segregated by individuals must lead 
to a certain distortion of the departmental 
means because of the minimizing of intra- 
annum selective changes. To sample the na- 
ture of such changes two departments, Lan- 


MARKING 


SYSTEMS 265 


guage and Agriculture, which are on opposed 
extremes in both intelligence and academic 
work,, were selected for further study. In 
Table 5 are given the means of all marks 
(individuals disregarded) for the freshman 
year in the departments selected, as well as the 
means outside the department. Similarly the 
selection of intelligence is found by represent- 
ing each course mark with the psychological 
test score of the individual making the mark,. 
The mean of Language marks so obtained is 
3.12, which is much higher than the value 2.87 
obtained by averaging marks as segregated by 
individuals. The mean so obtained in Agri- 
culture, 3.18, is not appreciably different from 
that (3.19) obtained in the manner described 
in connection with Table 2. The extra-marks 
mean of the Language group, 3.32, is the same 
by both methods, and that of Agriculture, 
3.00, is only slightly higher than the 2.97 ob- 
tained by the method illustrated in Table 2. 

However, since this study attempts to con- 
sider the relation between marks and intelli- 
gence it is necessary to observe what happens 
to the distribution of intelligence under the 
above conditions. We find that as compared 
with the individual representation of intelli- 
gence test scores in Table 1 which gave means 
of 156.4 and 113.1 to Language and Agricul- 
ture, respectively, we now have 165 and 111, 
respectively. In other words we find that the 
representation of intelligence by courses indi- 
cates an intra-annum selection which improves 
the general ability of the Language group, 
whereas for Agriculture the result goes in the 
opposite direction. 


The existence of great injustices in marking, 
even when the same proportions of marks 
are awarded, becomes apparent in a study of 
Table 6, unless it is true that academic work 
as represented by marks bears little relation 
to intelligence as measured by tests. In this 
table both intelligence and marks are com- 
puted upon the basis of number of course 
marks rather than upon the basis of segrega- 
tion by individuals. Examination of the per- 
centage distribution of marks in a 5-category 
system shows approximately the same features 
in both departments investigated. But 
viewed in the light of the percentage of indi- 
viduals above the percentiles (of the entire 
freshman group) which set off the mark divi- 

® Language having highest intelligence and lowest academic 


marks and Agriculture versa. 
*See foot-note (2). Thus in Agriculture 336 individuals 
took 430 courses. value for intelligence in Table 5 is 


upon the latter. 














a oe eee 


266 JOURNAL OF EXPERIMENTAL EDUCATION 


sions of the Mendenhall system of marking 
the following striking facts emerge: 

1. 16% of the Language group stand above 
the gs5th percentile of the freshman class in 
intelligence with only 6% of A’s; 2% of the 
Agriculture group are above the gsth per- 
centile of the freshman class but here also 
there are 6% of A’s. 

2. 36% of the Language group stand be- 
tween the gsth and 75th percentiles of the 
freshman class, with only 28% of B’s; 12% 
of the Agriculture group stand between the 


|Vol. 5, No. 3 
gsth and 75th percentiles of the freshman 
class, but here there 28% of B’s. 


3. 38% of the Language group stand be- 
tween the 25th and 7sth percentiles of the 
freshman class, with 43% C’s; 51% of the 
Agriculture group are between the 25th and 
75th percentiles with 49% of C’s. 

4. 9% of the Language group stand be- 
tween the sth and 25th percentiles of the 
freshman class, but there are 16% of D’s: 
27% of the Agriculture group are between 


TABLE 1 


DIFFERENTIAL SELECTION OF INTELLIGENCE IN THE DEPARTMENTAL ELECTION OF COURSES BY SIXx 
SUCCESSIVE FRESHMAN CLASSES OF A TEACHERS COLLEGE 


Critical 
Department N Mean S. D. P. E. Median ratio , 
Agriculture. = 336 113.1 47.4 1.7 105. 2 8.2 
E. Elementary Education ____ 112 114.1 46.8 3.9 108.4 4.7 
Rural Education. ______ 439 116.6 48.0 1.5 108.5 a! 
pew... ........ ee 492 120.0 47.4 1.4 113.3 5.1 
Physical Education._____- 688 121.7 61.5 1.3 112.9 4.4 
Manual Arts... 165 122.7 53.0 2.8 114.6 | 
Biology - eeteCanacebeat| 123.7 48.3 1.4 119.0 2.9 
Practice Teaching -. Se 251 125.2 46.6 2.0 118.1 1.6 
Art ee 410 125.8 50.7 1.7 120.1 1.5 
Home Economics __ : =e 89 126.2 45.9 3.3 116.4 0.7 
Psychology and Education __- ; 392 126.8 51.9 1.8 117.7 ‘3 
ae ae 1073 127.6 50.7 1.0 121.1 0.8 
ommerce______________- — 114 127.8 51.8 3.3 125.7 0.3 
Geography...._._._..__________. 679 128.7 52.3 1.4 122.3 0.0 
All Freshmen__________________. 1550 128.7 52.2 0.9 122.1 
English_______ ee ee : 1471 128.8 51.6 0.9 123.1 0.1 
ees. a ae ate 519 133.0 54.0 1.6 123.9 2.4 
tory - aide 625 138.4 52.2 1.4 131.8 5.7 
Physics and Chemistry - nied 301 143.8 53.1 2.1 136.9 6.7 
See 263 156.4 53.6 2.2 156.3 11.5 


Legend: a, critical ratio equals the difference between the departmental mean and the mean of all fresh- 
man marks divided by the P. E. of the difference of the means. 


TABLE 2 


AVERAGE ACADEMIC MARKS IN DEPARTMENTS WHOSE SELECTION OF INTELLIGENCE IS 
SIGNIFICANTLY DIFFERENT FROM THE FRESHMAN CLASS AS A WHOLE 


Critical 
Department N Mean Ss. D. P. E. Median ratio . 
Mathematics. ___. sia SoS aa 519 2.82 0.98 0.03 2.90 9.00 
History... __. , ake: 625 2. 86 0.92 0. 02 2.92 11. 50 
es St a: EPS 263 2. 87 0.96 0.04 2.94 5. 50 
Geography___.______. Poids acon 679 2.98 0.91 0.02 3.01 5. 50 
Ph ~ ducation____- sue 688 3.05 0. 85 0.02 3.02 2.00 
E. Elementary Education _- oe, 112 3.08 0. 68 0.04 3.04 0.25 
a 1550 3.09 0. 66 0.01 3.09 
Rural Education - : Ree 439 3.12 0.72 0.02 3.06 1. 50 
Physics and Chemistry. peer bees i 2 301 3.14 1. 00 0.04 3.11 1.25 
Morvecaitere....................- 336 3.19 0. 84 0.03 3.10 3.33 


Legend: a, same asin Table 1. 





Sent cpmeme ets 


Ma 





— — — 











No. ; March, 1937| MARKING SYSTEMS 267 
shman TABLE 3 
AVERAGE OF ALL OTHER ACADEMIC MARKS OF THOSE STUDENTS ELECTING COURSES IN 
d be CERTAIN DEPARTMENTS 
f Department N Mean S.D. P.E. Median Critical Ratio 
ft the A B 
f the j a 336 2.97 0. 60 0.02 2.95 6.00 5. 50 
1 and —@ Rural Education.................... 489 2.97 0.60 0.02 2.97 6.00 5.00 
Physical Education---_-_........... 688 3.01 0.61 0.02 3.02 4.00 1. 33 
j E. Elementary Education. _______ 112 3.03 0. 58 0.04 3.02 1. 50 0. 83 
; be- Goo oco so rkteninso ees _. 679: 3.08 0.63 0.02 3.08 0.50 3.33 
a All Freshmen. __-______-- .... 1550 3.09 0.66 = 0.01-—Ss 3.09 
‘ween [i Mathematics.___.................. 519 3.12 0.65 0.02 3.10 1.50 7.50 
Physics and Chemistry ___--______- : 300 3.13 0. 70 0.03 3.12 1.33 0.20 
eee oor ae, 3.17 0. 66 0. 02 3.15 4.00 10.33 
Ras sadawecs chee cundanebes 263 3.32 0. 68 0.03 3.32 7.67 9.00 
Y Six Legend: A, critical ratio of the difference between the mean of the marks in other departments and that 
of all freshman marks to the P. E. of the difference. 
' B, critical ratio of the difference between the mean of the marks in other departments and that 
ical of the departmental mean to the P. E. of the difference. 
Da 
2 
: TABLE 4 
1 COMPARISON OF MEANS OF ACADEMIC MARKS IN RELATION TO THE RELATIVE ROLE OF THE 
4 DEPARTMENTAL SELECTION OF INTELLIGENCE WITHIN THE ENTIRE FRESHMAN CLASS 
l Lang. Geog. Phys. Ed. All Physics & Ag. 
9 Chem. 
6 In department................._ 2.87 2.98 3.05 3.09 3.14 3.19 
5 Of other marks.____________. 3. 32 3.08 3.01 3.13 2.97 
i Hypothetical mean , SN 3.42 3.00 2.89 3.00 3.22 2.75 
8 Intelligence 
3 % above 95th percent. _- so SRS 4.79 3.65 5. 00 7.92 2. 56 
0 % above 75th percent......_.. 45.19 24. 48 21. 32 25. 00 32.37 14. 83 
% above 25th percent.....____ 87.60 75. 63 70.34 75.00 84. 82 65. 06 
% above 5th percent.....____ 97.88 95.98 93.24 95.00 97.11 92.49 
1 Legend: a, mean of marks assuming correlation between academic marks and intelligence test scores in 
4 the freshman class as 1.00 and computing on basis of departmental representation within a 
7 5-20-50-20-5 distribution. 
i 
S TABLE 5 
esh- 


ACADEMIC MARKS AND INTELLIGENCE ACCORDING TO TOTAL COURSE MARKS RATHER THAN 
ACCORDING TO TOTAL NUMBER OF INDIVIDUALS 


2 In department Other marks Intelligence 
: Lang. Ag. Lang. Ag. Lang. Ag. 
‘ le idk cisonendninirics «Sst hatinaehear 721 430 2147 3228 721 430 
F SSS eee id 3.12 3.18 3. 32 3.00 165 111 
- SE Rre iE ay See 02 03 01 01 1.3 1.5 
TABLE 6 


) 
) ; DISTRIBUTIONS OF INTELLIGENCE IN TWO DEPARTMENTS ACCORDING TO PERCENTILES OF ENTIRE 
) : FRESHMAN CLASS AS COMPARED WITH DISTRIBUTIONS OF COURSE MARKS 


Language Agriculture 
Percentiles Intell. Marks Intell. Marks 
O7 o7 
/0 0 
| iy I nc cwnnatncsnernccaccoccacsss 16 A 6 2 A 6 
% between 75th and 95th percent.......--- = 36 B 28 12 B 28 
| % between 50th and 75th percent........._________ 38 C 48 51 Cc 49 
% between 5th and 25th percent........-..._________- 9 D 16 27 D 14 
% between Qand 5th percent................______. 1 E 6 8 E 3 











268 JOURNAL OF EXPERIMENTAL EDUCATION 


the 5th and 25th percentiles of the freshman 
class and there are only 14% of D’s. 


5. 1% of the Language group are below 
the sth percentile of the freshman class but 
there are 6% of E’s; 8% of the Agriculture 
group are below the sth percentile but there 
are only 3% of E’s. 

The writers believe that the demonstration 
of a differential selection of intelligence in the 
different departments in conjunction with a 
relatively fixed percentage of the various let- 
ter marks reveals one of the factors tending to 
lower the correlation between academic marks 
and intelligence test scores for the usual un- 
selected school or college population. Such 
a population may well include sub-groups like 
those discussed above in which the students 
represent only a restricted portion of the in- 
telligence distribution while their academic 
marks distribute normally over the entire 
range. There might be, for example, a stu- 
dent with rather low intelligence who would 
receive an unduly high mark because he 
elected a course where he had to compete only 
with students of still lower intelligence. Con- 
versely, a student of better than average na- 
tive ability might receive an unduly low 
grade in a course where the competition was 
with students still higher than himself in in- 
telligence. The correlation of intelligence in 
each sub-group might be high because it was 
a matter of only relatively low or high in- 
telligence. But if all the sub-groups were 
thrown together, instances like the above of 
low intelligence with high grades and vice 
versa would obviously lower the correlation. 


The actual demonstration of the operation 
of these factors in lowering the correlation be- 
tween college marks and intelligence does not 
appear to be easily obtainable from our data. 
In the departments selected for intensive anal- 
ysis, Language and Agriculture, we find cor- 
relations of .52 + .03 and .259 + .034, re- 
spectively, as compared with a correlation of 
-55 + .o1 between all freshman marks and 
intelligence test scores. If marks could be 
considered as distributed approximately the 
same in all departments correction for atten- 
uation would bring the coefficients of the two 
departments studied to approximately the 
same value. If the marks within both Lan- 
guage and Agriculture are combined in a cor- 
relation with intelligence test scores the cor- 
relation obtained is .256 + .025 whereas if 
the extra-marks are so combined and so cor- 


[Vol. 5, No. 3 


related the correlation is .607 + .o17. Since 
the academic marks are not distributed in ex- 
actly the same proportions in all departments 
this particular mode of analysis is incon- 
clusive. 


DISCUSSION 


Factors determining the differential selec- 
tion of intelligence by departments. One of 
the major factors determining whether courses 
shall be elected and followed by sequential 
courses is the nature of the departmental per- 
sonnel. A reputation for rigorous marking 
and for heavy assignments may operate to 
frighten students of lesser ability away and to 
challenge some of the brighter students, or it 
may discourage all types. A reputation for 
easy marking and for ‘“‘snap” courses may at- 
tract students of comparatively low intellec- 
tual ability. Highly subjective marking, us- 
ually rationalized as an evaluation of person- 
ality characteristics, frequently tends to turn 
brighter students away. Rapid shifts in per- 
sonnel, particularly if departmental policies 
are involved, would tend to nullify faculty 
personnel as a selective factor. During the 
time span of this study, however, there has 
been remarkably little change in teaching 
personnel at the institution in question. 


Economic and social forces are extremely 
potent determinants of selection. Thus the 
level of intelligence of students electing 
courses in Agriculture has steadily declined 
during the six-year period involved in this 
study. The economic decline of agriculture 
also reflects itself in the changing character 
of those electing courses in Rural Education 
and in Early Elementary Education. On the 
other hand the pressure to make teacher-train- 
ing institutions also take on the functions of 
the junior college or regional arts college leads 
towards the appearance of a pre-professional 
group of superior ability, the members of 
which elect courses in the sciences, languages, 
etc., and consequently help to raise the level 
of intelligence in the departments which offer 
such courses. The differential in remunera- 
tion together with selective factors implicit 
in a longer period of preparation and a ten- 
dency toward an intellectual interest in sub- 
ject matter all operate to make the group pre- 
paring to teach in the high school superior to 
other curricular groups. These prospective 
high school teachers tend to elect courses in 
the same departments that are favored by the 
pre-professional group. 








M 








No. 3 


Since 
in ex- 
ments 
incon- 








March, 1937) 


Although all students preparing to teach 
are required eventually to take a specified 
number of courses in Psychology and Educa- 
tion only those individuals who are working 
for limited certificates or who already have 
some county normal courses are allowed to 
elect courses in this department during the 
freshman year. Since these curricular groups 
tend to be somewhat inferior, the position of 
this department in its freshman elections can- 
not be said to be representative. 


Factors determining the distribution of aca- 
demic marks. The existence of certain mark- 
ing standards or specifications in a college 
tends on the whole to prevent a gross distor- 
tion in the relative proportions of the various 
marks awarded by the several departments. 
It does not, when applied to all course elec- 
tions within a department, prevent the reser- 
vation of a greater part of the higher marks 
for upper classmen or majors within the de- 
partment. It does not, of course, prevent 
subjective factors from influencing the marks 
awarded and consequently lowering the corre- 
lation between such indices of achievement 
and intelligence. Thus we have noted that 
within the department of Agriculture there is 
a correlation of .259 + .034 between these 
values (although the major factor responsible 
for the low correlation may be the restriction 
of range in intelligence). A similar state of 
affairs would be found in the Rural Educa- 
tion group. On the other hand, the correla- 
tion in the Language group is .52 + .03. 
But the fact that the “cream” of the fresh- 
men appears to be concentrated in this group 
means that the awarding of marks on the basis 
of fixed proportions for all classes will auto- 
matically cause some of these students to re- 
ceive lower letter marks than if they had en- 
rolled in other departments. The sampling 
of ability is not representative and hence a 
grading system based upon the theory of rep- 
resentative sampling is not justifiable. 


The future of marking systems. It seems 
evident that no fixed and arbitrary system of 
marking which ignores the differential selec- 
tion of ability, whether temporary or rela- 
tively constant, can survive close scrutiny and 
investigation. Further, such a differential if 
operating on a national scale will inevitably 
prevent the use of standardized tests and na- 
tional percentiles which are intended to obtain 
an index of individual achievement which will 
make a valid comparison possible. 


MARKING SYSTEMS 269 


The proposal to abolish marks entirely or 
to avoid any measurement of the individual 
on a comparative basis may be dismissed as 
unrealistic. To eliminate competition in the 
educational system on the grounds that such 
extrinsic motivation is harmful is to set up an 
educational system at variance with the so- 
ciety in which the educational system exists. 
Moreover it is perfectly apparent that even 
cooperative or communistic societies do not 
function without extrinsic forms of motivation 
nor without regard to the differential ability 
and achievement of individuals. It may be 
that a solution to the problem of marks on 
the college level will be found in job analysis 
and the setting up of job specifications but 
there is no hiding from the fact that the place 
of an individual in a society will be deter- 
mined on a comparative basis. In a society 
termed “democratic” the disregard of differ- 
ential achievement in favor of personality 
values or judgments simply opens the door 
wider for favoritism and for subjective dis- 
crimination of all kinds on the part of those in 
power. 


SUMMARY AND CONCLUSIONS 


1. In the college at which this investigation 
was conducted, the election of courses by de- 
partments is not random but selective. Hence 
some departments do not show in their stu- 
dents a representative sampling of the abilities 
of the freshman class. 


2. If it is true that academic work or 
achievement is significantly related to intelli- 
gence, as the latter is measured by tests, no 
arbitrary or fixed marking system can give 
equitable marks or marks of comparable value. 


3. If differential selection of intelligence 
takes place in an essentially uniform manner 
throughout the country, comparable levels of 
achievement can not be ascertained by aver- 
aging ranks obtained from standardized tests. 


4. A differential selection of intelligence in 
conjunction with a marking system based up- 
on the normal probability curve necessarily 
operates to reduce the degree of correlation 
between the two sets of variables, intelligence 
test scores and academic marks. 


REFERENCES 


1. Bohan, J. E. Students marks in College 
Courses. Minneapolis: University of Min- 
nesota Press, 1931. Pp. xiii + 133. 











270 


2. Davis, J. D. “The effect of the 6-22-44- 


22-6 normal curve system on failures and 
grade values.” J. educ. Psychol., 1931, 22, 
636-640. 


. Eells, W. C. “An improvement in the 


theoretical basis of the five point grading 
systems based on the normal probability 
curve.” J. educ. Psychol., 1930, 21, 128- 
135. 


JOURNAL OF EXPERIMENTAL EDUCATION 


[Vol. 5, No. 3 


4. “ ‘The effect of the 6-22-44-22-6 normal 


curve system on failures and grade values’ 
—A comment.” J. educ. Psychol., 1932, 
23, 466-468. 


. Finkelstein, I. F. “The marking system 


in theory and practice.” Educational Psy- 
chology Monographs, 1913, No. 10. 


. Meyer, M. F. “Oberlin grades its stu- 


dents in the registrar’s office.” Peabody J. 
Educ., 1932-33, 10, 53-56. 














rma! 
alues’ 


1932, 


stem 
Psy- 


Stu- 
dy J. 





A: SEEN ttn BEM 


rt 
ry 





THE EFFECT OF METHOD OF PRESENTATION ON 
SPELLING SCORES 


Dewey B. Stuit 
University of Nebraska 


and 


CLIFFORD E, JURGENSEN 


Graduate Student, 


While scoring the spelling sub-test of the 
Cooperative English Test, Series II (1935 
form), which had been administered to the 
majority of our freshmen,’ it occurred to the 
writers that the method of presentation em- 
ployed in this particular test might not be 
very effective in securing a measure of a stu- 
dent’s real spelling ability. For the benefit 
of those unfamiliar with the Cooperative Eng- 
lish Test, Series II, it should be pointed out 
that a student’s ability to spell is measured by 
his ability to identify misspelled words in 
eight themes which make up the test of Eng- 
lish Usage. Along with the misspelled words 
are errors in grammar, punctuation, and capi- 
talization—all of which are to be corrected by 
the student in space provided at the right of 


University of lowa 


the page. It seemed to the writers that the 
spelling test was really a measure of proof- 
reading ability, and for that reason they de- 
termined to measure spelling ability by an- 
other procedure. 


The form of spelling test employed to serve 
as a check on the Cooperative English Test, 
Series II, was the dictation test. The new 
test consisted of the fifty-three misspelled 
words which are found in the eight themes of 
the Cooperative test, and it was administered 
by the four English instructors who had all 
the freshmen in their classes. The scores 
made on this test were then compared with 
those made on the Cooperative English Test. 
The results are presented in Table I. 


TABLE I 
SUMMARY OF AVERAGES, STANDARD DEVIATIONS, AND CORRELATIONS 


N Dictation Method Cooperative Test Difference Sigma ofthe Values of 
Mean Sigma Mean Sigma Difference Tio 
Men_._- 82 47.50 5. 20 27.20 13.15 20. 30 1.10 .74 
Women._-_.-- 108 49. 83 3. 64 36. 54 11. 26 13. 29 91 . 62 
: 190 48. 83 4.58 32. 51 12. 96 16. 32 .74 71 


The evidence in Table I apparently points 
to the conclusion that if spelling ability per se 
is to be measured, the dictation method is pro- 
ductive of distinctly higher scores than those 
obtained from the Cooperative English Test. 
It is rather striking that women did much bet- 
ter in the Cooperative Test than did the men. 
In the case of both sexes, however, the su- 
periority of the students in the dictation test 
is brought out. 

In addition to determining the actual dif- 
ferences in the magnitudes of the scores, the 
writers were interested in finding out how well 


? This study was made while the writers were members of the 
Department of Psychology and Education at Carleton College. 


the scores in the dictation test correlated with 
the scores in the Cooperative Test. This is 
obviously a matter of concer~. where percentile 
scores are calculated, because in such an in- 
stance a person’s relative standing rather than 
his actual score is of most importance. For 
the group involved in this investigation the 
values of 7,, (designating scores on the dicta- 
tion test as X, and scores on the Cooperative 
Text as X,) are given in the last column of 
Table I. While these coefficients are some- 
what higher than one might expect, they nev- 
ertheless are outside the range of high test re- 
liabilities. The factor or factors measured in 
the two tests are similar but undoubtedly are 


271 





- ——— 
a 


—_ 








272 JOURNAL OF EXPERIMENTAL EDUCATION 


not identical. Perhaps speed and attention to 
detail play a more prominent role in the Co- 
operative than in the dictation test. 
Doubtless it will have occurred to the reader 
that while all the students finished the dicta- 
tion test, all did not finish the Cooperative 
English Test. This would naturally have 
considerable effect upon the relative magni- 
tudes of the scores. In order to obtain a 
measure of the number of words actually over- 
looked by each student, the writers checked 





[Vol. 5, No. 3 


each paper to see how far each student was 
able to go in the Cooperative Test during the 
time allotted. This number was then com- 
pared with the score actually made by the 
student. Since extremely few words were 
misspelled in the corrections which were ac- 
tually made, this method gave an approximate 
measure of the number of misspelled words 
which were overlooked or possibly judged cor- 
rect by the students. The results are pre- 
sented in Table IT. 


TABLE II 
COMPARISON OF WORDS ATTEMPTED WITH SCORES MADE IN THE TWO TESTS 


Cooperative English Test 


Dictation Test 


Average Average 
N Average Average et oa not Average number 
Attempted Score Overlooked Attempted Score Misspelled 
Men.._.... 82 41.90 27.20 14.70 11.10 47.50 5. 50 
Women._.. 108 47.49 36. 54 10. 95 5. 51 49. 83 3.17 
Total.... 190 45. 08 32. 51 12. 57 7.92 48. 83 4.17 


In Table II the general trend of differences 
found in Table I is repeated. Judging from 
the results presented in the third and fourth 
columns of Table II, men overlook more words 
and work less rapidly than do the women. 
In the last column of Table II it can be seen 
that the average number of words missed per 
student was 4.17 (total list of 53 minus 48.83) 
in the dictation test, and in the third column 
it is shown that the average number of words 
missed, or rather, overlooked, in the Coopera- 
tive Test was 12.57. Hence, once more the 
superiority of the students in the dictation 
test is pointed out and the trend of sex dif- 
ferences noted in Table I repeated. 


The results obtained by the writers are in 
approximate agreement with those reported by 
Foran* in a study of spelling ability in grades 
6,7, and 8. In this investigation Foran used 
six different types of spelling tests. Of these, 
his fifth test, in which the students were re- 
quired to detect misspellings (and correct 
them) in a short story, corresponds to the type 
of test found in the Cooperative English Test, 
Series II. Foran found® that his fifth test 
gave the lowest mean score and the largest 
standard deviation of the six tests employed 
in his investigation. The mean correlation‘ 
of his fifth test with his first test, which was 

* Foran, T. G., “The Form of Spelling Tests”, Catholic 


University of A Educational Research Bulletin, vol. 4, 
no. 8. . D. C.: Catholic University of America, 
1929, pp. 16-22. am 

van ba 


the one in which the dictation method was 
used, was .766. It will be noted that Foran’s 
results agree very closely with those reported 
by the present writers. 

Summarizing the evidence, the following 
conclusions appear to be tenable for the group 
involved in this study. 


1. While it cannot be stated definitely that 
the dictation type of spelling test is more 
valid and reliable than the spelling test 
in the Cooperative English Test, it is 
clear that in this investigation students 
tended to score higher on the dictation 
test than they did on the spelling test in 
the Cooperative English Test, Series IT. 

2. Women appear to make better scores 
than men on both spelling tests but show 
their superiority chiefly in the Coopera- 
tive Test. Could it be said that women 
pay more attention to details and work 
more rapidly than do the men? 

3. The correlation between the scores made 
on the two spelling tests is fairly high. 
This is fortunate because it means that 
percentile scores will perhaps not be too 
greatly affected by the fact that the Co- 
operative English Test is not likely to 
give one a true measure of a student’s 
actual ability to spell a word if called up- 
on to do so. 

4. The results of this and other investiga- 
tions raise some questions as to the value 
of teaching spelling in an incidental man- 





enema 


ha Aa al aid 


ay Ei ne teow tha 








Vo. ; 


t was 
ig the 
com- 
y the 
were 
e ac- 
imate 
words 
1 cor- 
pre- 


er 
lled 


‘an’s 
rted 


wing 
roup 


that 
nore 
test 
t is 
ents 
tion 
t in 

II. 
ores 
10W 
Pra- 
nen 
ork 


ade 
gh. 
hat 
too 


a) 


0- 
it’s 
Ip- 
za- 
[ue 
in- 





ee chica ucdeaebiisia sadl 


ON ee 2 ill CMB RE At Rl na Tu te a 


faa il Ri Aaa = 28 V9.2 tat 





March, 1937) 


w 


ner. Can we expect to train good spell- 
ers if we do not teach spelling as such? 
Will a student be able to recall the cor- 
rect spelling of a word if he has been 
given training only in recognizing it? 


. All factors considered, it seems that a 


test requiring a student to write dictated 
sentences would be more valid as a test 
of spelling ability. This is essentially 
the type of test described by Foran® as 


* Foran, ibid., p. 24. 


METHODS OF TEACHING SPELLING 273 


the modified sentence type. Such a test 
appears to possess merit because in prac- 
tical situations the student is called up- 
on more often to recall the correct spell- 
ing of a word than he is to recognize its 
incorrect spelling in material which is 
read. Undoubtedly, the student should 
be prepared to proof-read his own writ- 
ing, but it seems to the present writers 
that first he should be instructed in the 
art of spelling correctly. 











DIFFERENCES IN THE ACHIEVEMENT IN GEOGRAPHY, CIVICS 
AND HISTORY, AND GENERAL SCIENCE OF TEACHERS 
COLLEGE ENTRANTS FROM DIFFERENT SECTIONS 
OF THE COUNTRY AND FROM RURAL AND 
URBAN POPULATIONS 


Nora A. CONGDON 


Assistant Director of Personnel 
Colorado State College of Education 


An examination of the results of the Teach- 
ers College Personnel Association testing pro- 
gram’ showed that the freshman entrants in 
the East apparently tested relatively lower in 
Geography and General Science than they did 
in Civics and History.2, The purpose of this 
study is to determine whether there are reli- 
able differences between the test scores made 
by teachers college entrants from different 
sections of the country and to discover some 
of the causes of these differences. 


1. DIFFERENCES BETWEEN THE EAST 
AND THE WEST 


The twenty-six colleges, returning results on 
both the English and Elementary Tests’ in 
the fall of 1935, were divided into three groups 
according to the section of the country in 
which they are located. The eastern group 
consists of nine colleges which are located in 
Massachusetts, New Jersey and Pennsylvania; 
the middle group, six colleges in Illinois, Mis- 
sissippi and Wisconsin; and the western group, 
eleven colleges in Arizona, Colorado, Idaho, 
Minnesota, Missouri, New Mexico, North Da- 
kota and South Dakota. 

The means, standard deviations, and stand- 
ard errors of the means were computed for 
the scores made by the freshmen entering the 
teachers colleges of the eastern group and 
those of the western group, on each of the 
three tests, Geography, Civics and History, 
and General Science. In general intelligence, 
as measured by the scores on the combined 
English and Elementary Tests, the entrants of 
the eastern colleges tested .103 sigmas higher 
than those of the total group of twenty-six 


nas a _——. J. D. The 1936 Report on the Cooperative Test- 

m of the Teachers College Personnel Association. 

Glende State College of Education, Greeley, Colorado, 1936. 

2 Entrance and Classification Examination—English and Ele- 

mentary Tests, Form A. Colorado State College of Education 
Greeley, Colorado, 1935. 


colleges, while the entrants of the western col- 
leges tested .131 sigmas below this national 
mean. In order to compare the means made 
by these two groups on the Geography, Civics 
and History, and Science tests, it was first 
necessary to correct them for the difference in 
the intelligence of the groups. This was done 
by subtracting from the mean made on each 
of the three tests by the eastern group .103 
standard deviations, and adding to the mean 
made on each test by the western group .131 
standard deviations. These corrected means, 
the sigmas and the differences between the 
means made by the East and the West are 
given in Table I. As the proportion of men 
and women students is approximately the 
same for both groups there are no spurious dif- 
ferences due to sex. 


In both Geography and General Science, it 
may be noted from the data in Table I that 
the college freshmen of the West test reliably 
higher than those of the East, while in Civics 
and History those of the East surpass those 
of the West by a statistically reliable differ- 
ence. The largest of the differences occurs 
between the means made by the two groups 
on the General Science Test. On this test 
those students ‘entering the western colleges 
test 6.92 raw score points or .4 of a PE score 
higher than those entering the eastern colleges. 
This difference is approximately three times 
the amount by which the eastern group sur- 
passes the western in Civics and History. 
The standard deviations made by the two 
groups on each of the tests are of practically 
the same size. 


2. RELATION OF THE S1IzE oF HIGH SCHOOL 
TO ACHIEVEMENT 


In order to show the relation between the 
size of the population unit from which a stu- 


274 


FERIA 


POTN TR gh 


M 


G 








1 col- 
ional 
nade 
‘ivics 
first 
ce in 
done 
each 
.103 
nean 
131 
ans, 
the 
are 
men 
the 
dif- 


p. it 
that 
bly 
Vics 
Lose 
fer- 
‘urs 
ups 
test 
ges 
ore 


nes 
ur- 
ry. 
wo 
lly 


OL 


he 





March, 1937| ACHIEVEMENT TEACHERS COLLEGE ENTRANTS 275 


TABLE I 


THE MEANS, WITH THEIR STANDARD ERRORS, AND THE STANDARD DEVIATIONS OF THE 
DISTRIBUTIONS OF SCORES MADE ON THE GEOGRAPHY, THE CIVICS AND HIsTory, 
AND THE GENERAL SCIENCE TESTS, BY THE FRESHMEN ENTERING NINE 
TEACHERS COLLEGES IN THE EAST AND ELEVEN IN THE WEST IN 
THE FALL OF 1935 


Mean 
Test Number Group (Corrected for SD SEm 
of students intelligence) 
PE Score* Raw Score 

Geography 1807 East 7.4 38.17 16. 38 . 385 
2275 West ie | 41.45 16. 84 . 353 

Difference and SE difference —3.28 + .522 
Civies and 1807 East 7.5 67.90 23.45 . 552 
History 2275 West 7.3 65. 54 23.36 . 490 

Difference and SE difference 2.36 + .720 
General 1807 East 7.3 59. 62 22.77 . 536 
Science 2275 West 7.7 66. 54 22. 871 . 480 

Difference and SE difference —6§.92 + .719 


*The PE score was read from a PE table in which the zero value was located at minus 5 sigmas or 
minus 7.4 PE values from the median. 


dent came and his knowledge in the three sub- score 

















jects tested, the students were placed, accord- 7° (7 roemenae es 
ing to the size of their high school graduating ,,| | << 
classes, into eight groups, and the means and we. 

standard deviations computed for each of ++ ss 

them. These tabulations are given in Table an 

Il and shown graphically in Chart I. _ a at . 

In Geography the size of the mean increases : ae . ty — 
as the size of the graduating class decreases, P “Sk, ny (/ , ie" 
that is, those students who graduate from ‘| ~ es, eM Ba 
small high schools tend to make the higher ,, ~ s”’ 2 ae 
scores on the Geography Test. The largest Oe 
difference among these eight means, that be- 7.1 . 
tween the group having less than 50 in the 7 
graduating class and that having over 500, is "°-—Tt 
6.72 raw score points or .6 PE scores. Of ee 
the 28 differences eight are reliable. — 


The means made by the eight groups of stu- —- » rmant THE MEDIAN SCORES MADE ON THE GEOGRAPHY, CIVICS AND 


ON OF 
dems on the Cvks end Hitary Tat acy tk- i ee eee 





TABLE II 


THE MEANS AND STANDARD DEVIATIONS OF THE DISTRIBUTIONS OF SCORES MADE ON THE 
GEOGRAPHY, THE CIVICS AND HISTORY AND THE GENERAL SCIENCE TESTS BY COLLEGE 
FRESHMEN GROUPED ACCORDING TO THE SIZE OF THEIR HIGH SCHOOL 
GRADUATING CLASSES 


Size No. Geography Civies & History General Science 

of of Mean SD Mean SD Mean SD 
H.S Stu- PE Raw PE Raw PE Raw 
Class dents Score Score Score Score Score Score 

1— 1863 Re 41.38 16.60 7.8 65.46 22.90 7.7 67.13 22.48 
50— 883 7.6 39.99 16.55 7.4 66.26 22.55 7.6 62.31 22.10 
100— 486 7.4 38.34 16.95 7.8 65.37 24.40 7.3 60.28 23.12 
150— 291 7.3 37.67 15.25 7.3 65.29 22.00 75 57.72 23.056 
200— 307 7.3 35.92 15.52 7.2 64.36 22.35 7.2 58.62 20.82 
250— 250 7.4 38.39 15.65 7.5 68.64 22.35 7.3 59.37 21.76 
350— 186 7.4 37.89 16.55 7.5 67.94 23.70 7.2 59.11 22.00 
500+ 211 7.1 34.56 16.15 7.3 65.45 23.20 7.0 55.45 22.22 








276 JOURNAL OF EXPERIMENTAL EDUCATION 


regularly with the size of the graduating class. 
Although, unlike either Geography or Science, 
the scores made by the students graduating 
from the larger high schools are apparently a 
little better than those made by students from 
small high schools, none of the differences are 
statistically reliable. 

Again, it is probable that in the smaller 
high schools either the students receive more 
effective instruction in science or they have a 
better opportunity to acquire scientific infor- 
mation through first hand contacts with a 
more favorable environment, as the size of 
the achievement test scores in Science in- 


' creases inversely to the size of the high school 


graduating classes. The difference between 
the extreme groups is 11.68 or .7 PE. There 
are eleven reliable differences out of the 28. 
All of those involving the smallest group 
(1-49) are reliable. 

In Chart I it may be seen that these trends 
are not consistent throughout the distribution. 
In general, on all three tests the means de- 
crease as the size of the graduating classes in- 
creases, up to 200. At that point there is a 
decided increase, followed by a decrease. It 
may also be noted that the mean PE scores 
made by the 200-group are the same on all 
three of the tests. Another fact shown by 
the chart is that those graduating with classes 
of from 150 to 250 and 500 or more make, on 
the average, scores below the national median 
on all three tests. 


TABLE III 


[Vol. 5, No. 3 


Another method of showing the relationship 
between the size of the high school graduating 
class and achievement in these three subjects 
is by the coefficient of correlation. Two 
methods, the rank-difference and the product- 
moment, were used in computing the correla- 
tion between the sizes of the high schools as 
represented by the eight groups and the mean 
scores made by these groups on each of the 
three tests. 


The coefficients are as follows: 


Rank- Product- 
Difference Moment 


Test rho r SD, 
Geography ------- —.69 —.89 .073 
Civics and History- 27 25 119 
General Science... —.79 —.88 .079 


The coefficients on the Geography and 
Science tests are high, showing that a marked 
influence is exerted by the size of the popula- 
tion unit. As there are more large high 
schools in the East than in the West there may 
also be sectional differences. 


3. COMPARISON OF THE RELATIVE INFLUENCE 
OF THE LOCALITY AND THE SIZE OF 
THE POPULATION UNIT 


In order to determine, if possible, which 
factor—locality or size of the population unit 
—is more closely connected with variations in 
the scores made on the Geography, Civics and 
History, and Science tests, the scores of the 





THE MEANS, CORRECTED FOR INTELLIGENCE, MADE BY THE FRESHMEN ENTERING EASTERN COLLEGES 
AND THOSE ENTERING WESTERN COLLEGES WHO GRADUATED FROM LARGE HIGH SCHOOLS 
(GRADUATING CLASSES OF 250 OR MORE) AND THOSE WHO GRADUATED FROM SMALL 
HIGH SCHOOLS (GRADUATING CLASSES OF LESS THAN 50) 


an ae ae | [6 


_ Large Small Difference Ms | 
High Schools High Schools Large minus small High Schools 5 | 
Fast West East West East West 
No. of Students. _- 315 190 245 1186 ; 
Geography........ 33.93 39.01 38. 33 42.00 —4.40+1.449* —2.99+1.314 
, (—.4 PE) (—.3 PE) 
Difference._______ —5. 08 + 1. 526* —3. 67 +1. 216* 
(East minus West) (—. 4 PE) (—.3 PE) 
Civics and History 65. 90 64. 61 65. 56 65. 65 .84+1.990 —1.04=+1.765 
‘ (.1 PE) (0 PE) 
Difference_______- 1.29 +2. 091 —.09 +1. 643 
(East minus West) (.1 PE) (0 PE) 
Science. .......-.-. 51.48 64. 06 64. 58 67.32 —13.10+1.868* —3.26+1. 755 
? (—.9 PE) (—.2 PE) 
Difference... ____ —12. 58 +2. 055* —2.74+1. 531 
(East minus West) (—.9 PE) (—.2 PE) 


*indicates a reliable difference. 








Vo. 


nship 
lating 
rjects 

Two 
duct- 
rrela- 
Is as 
mean 
f the 


.NCE 


hich 
unit 
is in 


the 





3 


March, 1937) 


entrants of eastern colleges who had graduated 
with high school classes numbering less than 
so were put into one group, those who had 
graduated with classes of 250 or more in a 
second group. The same procedure was ap- 
plied to the entrants of the western colleges. 
The means, corrected for differences in intelli- 
gence, and the differences, with their stand- 
ard errors, between the means are given in 
Table Il]. From these data the following 
may be noted: 

1. In the West there are no reliable differ- 
ences on any of the tests between the scores 
made by students graduating from large and 
those graduating from small high schools. 
The largest difference is that between the 
means made by the two groups on the Geog- 
raphy Test. 

2. The differences between the means made 
on the three tests by the entrants of the east- 
ern colleges and those of the western colleges 
are larger among the urban groups (high 
school graduating classes of over 250) than 
among the rural groups (classes of less than 
50). 

3. In Geography the largest difference, that 
between the urban groups of the East and of 
the West, is —5.08 (.4 PE scores) which is 
approximately one-third of the standard devia- 
tion. Three out of the four differences be- 
tween means made on the Geography Test are 
statistically reliable. 

4. None of the differences between the 
means made on the Civics and History Test 
are statistically reliable and, furthermore, the 
largest of them is only .1 of a PE score or 
approximately 1/17 of the standard deviation. 

5. The largest differences occur between 
means made on the Science Test. The differ- 
ences between the East and the West are ap- 
proximately 4.5 times as large in the urban 
districts as in the rural districts. The same 
relationship holds true for the differences be- 
tween the means made by graduates of large 
high schools and those of small high schools. 
The differences between these two groups in 
the East is 4.5 times as large as the difference 
between similar groups in the West. 


ACHIEVEMENT TEACHERS COLLEGE ENTRANTS 


Nw 


4. CONCLUSION 


The data presented in Table III show that 
the test scores made by college entrants in 
Geography and General Science are affected 
both by the locality of the institution which 
they are entering and by the size of the pop- 
ulation unit from which they came. In Geog- 
raphy apparently the locality has a slightly 
greater influence than the size of the popula- 
tion unit as the differences between the East 
and the West, —5.08 and —3.67, are larger, 
though not by a statistically reliable amount, 
than the differences between the urban and the 
rural districts, —4.40 and —2.99. 


In Science the differences between large 
and small high schools, —13.10 and —3.26, 
are larger, though not reliably, than those be- 
tween the East and the West, —12.58 and 
—2.74. This indicates that, in this subject, 
the size of the population unit may be the 
more influential factor. 


The differences between the means made by 
the extreme groups in Civics and History, as 
given in this table, are small and not reliable 
and, therefore, the indication is that they are 
influenced neither by locality, East or West, 
nor by the size of the population unit, rural or 
urban. 


The facts presented in this study are of 
great importance to curriculum experts as well 
as to geography and science instructors. It 
is likely that students in the small western 
high schools have experiences, due to their 
environment, which favor the development of 
concepts in geography and science. Their 
outdoor life, the sectional occupations such as 
agriculture and mining, and the dependent in- 
dustries may afford an opportunity for some 
of the contacts which contribute definitely to 
the geographical and scientific knowledge of 
these students. 


Special attention should be given to the 
vitalization of Geography and Science in the 
large eastern schools in order partially to make 
up for the lack of actual experience in these 
fields. 





—" 


OE a 








= eee aviator 


Rene SOSA 8 


AN EXTENSION OF THE KELLEY-WOOD AND KONDO. 
ELDERTON TABLES OF ABSCISSAE OF THE UNIT 
NORMAL CURVE, FOR AREAS (’2:) BETWEEN 
.4500 AND .49999 99999 


HERBERT S. CONRAD AND RUTH H. KrauseE* 
University of California 


l. INTRODUCTION 


If 2500 entering college freshmen take a 
certain intelligence test, the proportion of 
cases with scores below the top student’s is 
.9996; the proportion of cases with scores be- 
tween the top student’s and the median is 
.4996. In a normal curve, what is the nu- 


*) which di- 
o 


vides the top freshman from the 2499 stu- 
dents below him? This seems a very reason- 
able question; yet the answer is not immedi- 
ately available (with exactitude) from any 
existing table of the normal curve. Neither 
the Kelley-Wood (7) nor the Kondo—Elder- 


merical value of the abscissa ( 


ton (8) table supplies values of = for areas 


above 14 a* = .4990; and inverse interpola- 
tion into such tables as Sheppard’s (11, 15), 
Burgess’ (2), or Krause and Conrad’s (9)7, 
while possible, is for many users of statistics 
a somewhat unfamiliar and irksome solution. 


II. DESCRIPTION OF THE TABLE 


Table 1 was prepared by the writers as a 
supplement to the Kelley-Wood and the 
Kondo—Elderton tables. Six decimal values 


‘ . x : 
are given of the abscissae, = corresponding 


* For criticism of the manuscript and valuable suggestions, 
the writers are indebted to Professors S. H. Levy, R. T. Craw- 
ford, and A. H. Mowbray, of the University of California 
Acknowledgment is also made of correspondence with the late 
Professor Karl Pearson and with Professor T. L. Kelley. In 
the computation of Table 1, we were assisted chiefly by Mr 
L. Chan, Miss C. B. Hess, and Mrs. L. H. Aylesworth. Final 
responsibility, of course, rests with the authors, both as to 
methods employed and results obtained. 

* The area under the unit normal curve, between the mean 
(or median) and a given abscissa, has been designated by 
Sheppard (15) and Pearson (11, p. xviii) as he a (see Figure 
1). In the Kelley—Wood table, this particul ar area has, by 

elley (7, p. 97), been designated by the symbol J. 

t We should mention here also the “= by Glaisher (5), of 
use for values of % a above .49998. a both Burgess’ and 


Glaisher’s tables, the argument is ¢ ( t= a ), rather 


x x 
than - (Sheppard) or pg (Krause and Conrad) 


to the areas ¥2a=.4500 through “a 
.4999, and %a==.49991 through %a= 
.49999; then five-decimal values are given of 
the abscissae corresponding to the areas 
Y2a==.49999 t _— through .49999 9 
Y2a==.49999 91 through .49999 99 
Y2 4==.49999 991 through .49999 999 
Y2a==.49999 9991 through .49999 9999; 
and finally, four-decimal values are given for 
the areas 12 a= .49999 99991 through 
49999 99999. 


























T 
~x Mean we. 
= Mean +45 


Figure 1. Illustrating areas under the normal curve 


The area with which Table 1 bezins is such that 
only 5.00 per cent of the cases in a normal frequency 
distribution lie above (or, alternatively, below) this 
area (cf. Figure 1). When finding the abscissa cor 
responding to an area above .9500 or below .0500 
(ie., %@ = 4500), it is frequently desirable not to 
drop the last decimal, since the last (fourth) deci- 
mal of the area affects the value of the abscissa in 
the third decimal or earlier. But to take accurate 
account of the fourth decimal of a large area, when 
reading the Kelley-Wood or Kondo—Elderton table, 
would involve interpolation requiring fourth-order 
or even higher differences. Table 1 is designed to 
eliminate the need for such labor—Above ™%a« 
4999, decimals beyond the fourth must obviously 
be taken into consideration; the values of x/¢ cor- 
responding to such higher values of %4@ are given 
in table 1 up to the area 1%4a= .49999 99999. In 
a normal curve, only one case out of ten billion lies 
above (or, alternatively, below) the abscissa cor 
responding to this terminal area of Table 1. 


278 


ate 


bas 





Pn ca i how BS 





a 


as at ’ 


Me 








279 


KELLEY-WOOD TABLES 


. 
4 


THE 


March, 1937) 











“PORSPEO'T St ‘OSH sTenba 


» %{) Bale ay} uaym * 


2) Bsstasqe ay} JO anyeA ay], 
x 


:pwal eq OL, 








889190 3% 9E96F0 SG 65270 re Lgccro Ss O€SEFO Ss cISlhO Z% 202680 2 —O00¢SLE0 3% 90¢SE0 3 OZSEEO SZ 6LY" 
| GPSIEO SZ €L°96360 3G LI9Ld0 3% 99960 G OLLESO S ILLIZO % OFS610 % 9I6LIO % 0009TO “2% T60PF10 3% 8LP 
| 6812102 C6s010 ZG 80F800 G L2ZS900 “SG Po9P00 “3 881200 3S 626000 “3% LL0666 ‘I 2E2L66 ‘TI €6£°66 “T LLY’ 
sh 9EL166 I 816686 ‘I 9OTS886 I 008986 ‘I LOCPs86 I 8O0LE86 | 6c6086 I oPl6l6 ‘I S9ELL6 T 9LP* 
109926 , 6E8EL6 I PROSL6 I CEeoLle 1 o6 62896 | Po8996 I Se1S96 I 86EE96 ‘I SLOT96 T 96696 “T SLY’ 
| 996 8o6 "7 €gcg9ce TI LESFS6 I [é¢6 ‘1 OSPIS6 ‘I 008676 ‘I CZI8P6 "I 9CP9F6 I Z6LVFE6 I PELErG 1 PLY 
| T8Plre'l fE8686 I 061866 I egegre ‘I LZ6PS6 T P6SEE6 I oL9LE6 1 cco0ge I EPPS I LE8926 T LYv* 
—S82926 ‘I 8E9ES6 I 9F0Z66 1 6°F0c6 I 9LE8I6 1 66cL16 I 931916 T ScIPi6 I P6916 ‘I 9SOTI6 'T oLP- 
8606 I ZE6L06 1 L8€906 I LPSP06 1 [TEE06 I 6LLI06 I 292006 ‘1 O€L868 “T Z1ZL68 T 869968 “T ILy 
| S8Ir6s 1 £8968 I Z8T168 ‘1 989688 I £61888 I COL988 IT 1Zsess 1 TPLESS I $9088 © I ¥6LO88 “T 0LF 
9ZE6L8 I Z9RLL8 I £0FILS “I LP6PLE I C6PELS I 8POZLS8 I FO9OLS ‘T F9T698 T SZLL98 T 962998 ‘I 69P° 
LO8F98 I EFPPEDS I 6c0698 I £09098 “I 1evese “t GSLLSB8 ‘I 9LE9S8 “T PLEPSS I -—-SLgess ‘T O8TZss ‘T 89P° 
; 88zoss | LOP6PS I 9TOSPS IL CEg9Ps I 8osors I PS8EPrs 1 PISZrS I LYLIP8 I PSL6ES8 TI PZPRES I LOF" 
| L9OL _- PILGES I PIEPES I LIOEER ‘I PLOTTER TI PEEOES I L66828 ‘I F99LZS I PEEIZS TI L00SZ8 “1 99F° 
_ E8986 Tv E9€GG8 8 iT SPOTS I L€L618 “I OZPSI8 T éIILIgs ‘1 LOSST8 “I SOSPTS T 90ZET8 “T TI61L18 T Sor 
8I90T8 I 82608 1 ZP0808 “TI 891908 “TI LLYSOS ‘I 00ZFO8 ‘I ——$26z08 ‘IT €S9T08 T PSE008 “T SIT66L T POF ® 
—G¢GsL6L I —S6S96L I LEES6L I E80P6L T L€8z6L ‘I c8SI6L T CesoeL ‘I 260682 ‘T TSSLsgL I €1998L T 9r° 
SLESsL I 9PIPSL I 9T6C8L T 689I8L I P9POSL T EPC6LL I €Z08LL I LOSOLL ‘TI €6SSLL 1 C8EPLL © Z9P° 
SLTIELL IT L196 os P9LOLL I 99691 “T POSS9L 'T 69IL9L 'T CL6S9L ‘I PSLP9L I 969891 T OLPS9L T T9P* 
Lest9L Tt 9P009L I 8988SL IT G69LEL ‘I BIggsL 7 LygeeL ‘I SLIFEL ‘T e1ogss I SPsISL'T 989082 ‘T O9F * 
LES6PrL TI OLESPL I CIZLPL I €909FL I SI6PPL T SOLEFL I OZ9ZPL I LLYIPL I 9EE0FL ‘TI S616EL I 6SP° 
T9O8EL TI LZ69EL I 96LSEL TI 999FEL TI 6ESEEL “TI ElPSEL T 06E1EL TI 69T0EL TI TS066L ‘T PEGBLEL I 8SP° 
OZ89GL TI 80LSSL I 86SPSL I O6PESL I P8EGSL O8SIZL TI SLIOZL I 6LO6IL ‘I TS6LTL ‘TI 9889IL T LSV" 
S6LSIL I IOLPFIL T oI9EIL I —S2ssiIL 'T OPFITL 'T 9gsorL TI CLZ60L T 96180L TI 6ITLOL 'T EPO90L “T 9SP- 
OL6POL 'T 668802 T 628Z0L T GOLTOL ‘I 969002 'T £€9669 © I 128869 'T _TTS269 I PoP969 T 86E969 “T ccr’ 
PPEP6S I 16Z&69 ‘I 1 P2269 “T £61169 T 9PF1069 T 101689 I 890889 “T LIOL89 ‘T 816989 “T TP6P89 “T PoP” 
S06E89 ‘I TL8Z89 IT 6E8I89 ‘I 608089 T I8L6L9 TI PELELI 'T 6dLLL9 I 9OL9L9 “IT Cg9cl9 "I -——-S99PL9'T €or 
LP9ELD I TE9SL9 ‘I 9T9TLO T PO90LS I £69699 ‘I ESSR99 “1 9LGL99 ‘T 0LE999 'T c9¢s99 'T £9999 'T oor 
299699 “1 £99299 I c9cI99 'T 699099 "I -—SLg6S9 'T c8gsseg T 169299 ‘TI 209999 “T PI9¢eS9 ‘TI 8Z9FS9 'T IgP° 
EP9ES9 TI 099299 IT 6L9TSI TI 669099 ‘I TSL6P9 I PPLSPO I 69LLPE9 T 96L9F9 T PZ8CPh9 “TI +PSS8PP9 T OSPF” 
=== g° L . 9° i y° . e° s° fed a » & 
O0c gaoqdy ( a ) SVGuUY GDAUND 'IVWNYON OL DNIGNOdSaYHOD ~ 410 SAN TVA 
I atavL 
v ta ei ae ye Sie carne one eee a 
Nets ipaidtie.-x Soi eae ety NRE iE, cas Np tas 0) 
is 54 
, ~2 5) apate2owse > © it ms wp 4 
oO ie ~~ Pe [2 } ¢£ SRSESSHP ST ESSSs lies esse 


— sa 
































[Vol. 5, No. 3 


JOURNAL OF EXPERIMENTAL EDUCATION 


280 





“6PLESO'S SI ‘OOS sjenba (»3{) vase ay} ueym 


(=) BSS}OSqE BY} JO ANIBA BY, ‘peal aq OF, 










































































































































































a en RE ger ny ae eeeee a + 


€198 ‘9 O0FSS 9 POGT *9 6PPI 9 P60T ‘9 2080 °9 —¢¢gc0'9 6£80 9 6F10°9 T8466 9 | 6666 6666F° 
T8L66 °S 61P88 “S 9L9T8°S 9P89L °S EL0EL “S £1669 °¢ 6EEL9 °¢ 8h0¢9 “¢ 02089 °¢ 00ZT9°S | 666 6666P° 
00ZT9 °S C806F “S O88IF “S §TL9E °S GLIZE'S —OSE6Z'S —CzE9gz'¢ 9902 68812 °¢ PE66IS | 66 6666P° 
PE66I “S 96890 °¢ 06166 ‘F LESEE 'F PIT68 “F P9SSS “FP 0028 “F CE86L ‘FP LOPLL ‘YP ovEcSl Fb | 6 6666P" 
SPESL FP SEII9 “Fh 6929 "F SIS9P FP LILIP'P 6SLLE °F 98EPE 'F CPPIg 'P 9E882'F  I68h9Z'F | 6666F° 
T68h92 °F O8FLOI'h TISZIO'F OOFFFG 'S Z6S068'S  9ZI9FR'E S9T808°S ZIOSLL’S  EPFeCchL'g 9TO6IL*S | 666F° 
STO6GIL'S F800FS'S FIOTEP es —S6LES9E'§ LZ906Z'E oRggsez'e IS9P6T'§  LOGSST’S  68EIZI'E 282060 § | 66P° 
PISI90'S ZL9SEO'S FSFIIO'S G88886'S 8ELL96'SZ FSLbE'Z —090626'S S8EZII6'Z PHOEPER’z S9I8L8°S | 86P° 
9€LZ98'S =S96LPS'Z = LELEES “Zz 8S10Z8'°S + FEOLOS’S OLEREL'Z OSTZ8L°S LEEOLL’'S GLEBGL‘Z T8LLPL‘S | LEP’ 
GIOLEL"Z I899%L'% I8E9IL'z €8790L'S Fb8969°S G6FFLE9'Z 9828L9'S Zh8699°% 109099°2 0L0Z99 2 | 96F° 
GoLEb9'Z = =Peeesg’zZ =ESeLz9°z 82L619°S PpeOZI9'Z TEsbog'zZ SSIL69°2 = FI668S°Z = LOSZEE'*Z 6Z8EL9 2 | SéP° 
| PL6899°Z =8EZ299'Z  -gI9Eee'z vOIGPS Z + 669ZF9'°Z 96E9EC'Z G610ES 2 —Cs80rZs 2 OLOSIS'z PPIZIS 2 | PEP" 
908909 'S 2990092 G6LRREE'Z 98268F'S G6OLESP'Z LZESLE'Z SS6GLF'Z 8S9L9F'Z zbzoF'Z E9ZL9P'S | S6P" 
| b9TSSh'S «=LZILER ZS «CSI ZbR'Z 9ECLED'Z =GLEZER'S «= SLGLZP'Z SE8Zch'S  ShISIP'S gOS 'Z 9T680F 2 | Z6r° 
SLEFOK'Z 0686682 —OcHSEE *z 9S0T6E'Z SOL98E°SZ FPHOPZRE'Z —SPISLE'S 826EL8°2 sSLE9e'z 8T9998 ZS | Té6P 
| WESI9E'S 69FLSE'Z  zePEcEz ELPGPE ZS = LESshe 2 —czolte'zZ PSLLEE'S = SIGESE'S OI10EE'Z ShE9ZE SZ | O6P 
| GI9ZZE°SZ = 8OGSIE'Z = 9EZETE “Zz S6STIE'Z FS6LOE'2 FoOPPOE'z Cf8008 SZ é6ZEL6Z'°2 —cEsees'z 8980622 | 68P° 
| 886982°% 9ISe8z'z oEFl0sz'z 691912 '°S —SEFELZ'2 —CZIOLZ ‘2 O899Z'S + 6LEE9T'S ~=-SbEOOZ'Z 6ZI1LSZ'°S | S8P° 
| 6868SS°S ZLLOSZ°Z = LZOLES“Z bOShKS SZ «SOFIES SS CEZEREZ‘Z PIZLES ZS =GZSZEZ°S SS «GNZESSZ G1Z923 3 | LEP’ 
VECECS “Z =-LLZOZS'"Z SS BELLIZ°Z 6IFFIS'S ISIIZ’2 gyE980z'z2 GLLS0Z'S = 8Z6Z0Z "SZ SL60002'Z 98ZL6I SZ | 98P° 
| S6PPGT'S 9ILIGI'Z LeEssi-z €IZ981°S LEPESI'S Y9LLOST'Z TSO8LT°S ZOPSLI'S 68LZLI'2 O0600LT “ZS | SBP" 
LSPLOT'Z = 6E8P9T'Z g9ezz9I'zZ LY96ST "SZ ELOLST‘Z IcKeI'z 996IST “Z FEEFI'-S 9169FI'Z IIPPRL ‘2 | PRP 
| GI6GIFI°S IFP6ET'Z cLE9ET“z ECShEl SZ E80Z8I'Z  yge96zI'z ChCLEL SZ = OFSEZT -Z —OGPSZI'Z GLOOZI Z | SRP’ 
| 9OLLIT’S ZSESII'Z G00kTI'z SLOOIT"S SSE80I'Z  0c090T'z SSLE0T'S LOFIOL’Z  Z6I660'Z 126960 °2 | Z8h° 
| bL9¥60'S = LERZ60'Z «= 86 1060°Z 916180'°S F9LS80°%  zoceso'z ILEI80°S 6816L0°S LIOLLO'Z —SosPr_o's | ISP’ 
GOLZLOS += 6SS0L0'Z = 9ZPR|0'Z 60€990°S L8IF90'2 180290°Z —$86690'°2 L68L90°2 éIgsco'z *6PLE90 2 | OSPF" 
O0Sh GTaogy (» %) svauy GAUND ‘IVWYON OL ONIGNOdSTIYOD — 40 SGN TVA 
(penurju0g) T aTavy, 


ie eee a 














March, 1937) 


A minus sign has been affixed to every en- 
try in Table 1 ending in a “raised” 5 (or a 
“raised” 50 or 500). Thus, when % a= 


Be : 
4831, the value of — is given in Table 1 as 
Co 


wi alee x. :' 
2.122450—; if this value of — is required to 
o 


four decimals only, the four-decimal value 
should be written 2.1224; the five-decimal 
value is 2.12245—. 


III. ACCURACY OF THE TABLE 


Rigorous precautions have been taken to 
assure the accuracy of the entries in Table 1. 
All computations were carried at least two 
decimals beyond the number given in the final 
table. All computations were performed 
twice, the second computation being per- 
formed independently of the first, and almost 
always by different individuals with different 
calculating machines. Special computations 
were carried out for any value of = in which 
there appeared to be possibility of error in 
the last retained decimal; the additional com- 
putations in every case confirmed the original 
(cf. Section VII). Finally, a difference-table 
was constructed of the results. By these tech- 
niques, even so small an error as 1 in the last 
decimal should be improbable. 


IV. USE OF THE TABLE 


The abscissa corresponding to any four- 
decimal value of % a between .4500 and .4999 
may be read directly from Table 1 correct to 
six decimals. Beyond 42 a= .4999, Table 1 
is the most convenient source at present avail- 


able for values of ~ corresponding to the given 
oC 


, x 
values of %2a. Accurate values of —, cor- 
o 


responding to very high, accurately known 
values of !4 a, cannot be read directly from 
Table 1. Suppose, for example, that it is re- 
quired to find the abscissa corresponding to 
such an area as 4% a= .49999 99976. One 
might drop the last figure of this area, and, 
reading directly the abscissa corresponding to 
Y2a==.49999 9998, obtain the answer 
5.88419 ¢. This answer however, is probably 
not correct even to the second decimal; so it 
would be desirable to retain the last figure of 
49999 999076, and employ some type of in- 
terpolation. For rough work, ordinary linear 


THE KELLEY-WOOD TABLES 281 


interpolation would be sufficient. The answer 
by linear interpolation may be tested and re- 
fined by the use of Glaisher’s table (cf. Sec- 
tion VII, 4, below), or (for very great ac- 
curacy) by the use of Schlémilch’s formula 
(Section VII, 5, below). A solution of inter- 
mediate accuracy (correct to three or four 
decimals) may be obtained by preparing a 
carefully drawn, large-scale graph of values of 


* in the region of 1% a= .49999 99976. 
Co 


Coe ‘ 
To convert the values of — in Table 1 into 
Co 


PE or “probable error’ units, it is necessary 
merely to multiply each entry in Table 1 by 


I 
——— OF 1.48260222 (2, p. 2 , 
hada 4 (2, p. 279) 


In practical statistical work, the values of 
So. ; : . 
— given in the last few lines of Table 1 (cor- 
Co 


responding to very high values of 42a) may 
seldom be called into service. In theoretical 
work, on the other hand, it is possible that 
values even beyond those given in Table 1 will 
occasionally be desired. 


V. Previous TABLES 


Table 2 describes the four previously pub- 
lished tables of abscissae corresponding to 
areas under the unit normal curve. None of 
these tables extends beyond the area 2 a == 
.499; and each proceeds by a uniform inter- 
val (.o1 or .oor, respectively), which, for 
higher values of %a, may be considered 
rather large. 


VI. Discussion 


Tables of the area under the unit normal 
curve, corresponding to a given deviation from 
the mean, have been published by many writ- 
ers (4, 9, 19). The inverse type of table 
(such as Table 1), giving the abscissa for a 


Some interesting discrepancies occur in the values 
of x/¢ given in the various tables. According to the 
Kondo-Elderton table (and our own computattions), 
the Kelley-Wood values of x/¢ are too large by 1 
in the sixth decimal at %4¢— 456, 463, and 493, 
and too large by 2 at 4% a= .489; they are too small 
by 1 in the sixth decimal at %o¢=—= 487 and .498, 
and too small by 3 at %4¢=— 499. In a footnote 
(7, p. 97), Kelley has given the value of the abscissa 
for 4a = .499 as 3.09022 850-- 7; according to the 
Kondo—Elderton table, and our own computation, the 
abscissa at this point (correct to ten decimals) equals 
3.09023 23062¢. (The comparisons in this paragraph 
have been limited to abscissae for values of 
%a > 450.) 





ee ee 
eter Sr se _« 

















282 JOURNAL OF EXPERIMENTAL EDUCATION 


[Vol. 5, No. 3 


TABLE 2 


x 
PREVIOUS TABLES OF ABSCISSAB (=) CORRESPONDING TO GIVEN AREAS UNDER THE 
UNIT NORMAL CURVE 








ARGUMENT 

















-| No. of deci- 
Author(s) | Date Range of mals in 
| Area | Interval areas, in abscissae 
| | | terms of 
| | va « 
sina acpi nee Gamera ae maemene wee a 
Sheppard (15)*.__.------ _— 1903 | « | 01 |. 00—. 40 7 
Sheppard (16)*_..........-.---- 1907 | Permilles 001 | .000—. 499 4 
Kelley, Wood, and Kelley (7)___- 1924 Y« . 001 . 000—. 499 6 
Kondo and Elderton (8)+_____- 1931 (1+ «) . 001 . 000—. 499 10 








* Reprinted in Tables for Statisticians and Biometricians, Part I (11). 


+ Reprinted in Tables for Statisticians and Biometricians, Part II (12). 


given area, states the extremity of the event 
corresponding to the given probability. The 
task of preparing this inverse table has at- 
tracted much less attention than the first. 
One reason for this is the greater general util- 
ity of the first type of table; another is the 
difficulty of calculation of the inverse table; 
a third is the fact that, once the first table is 
available, it can be used (although less con- 
veniently) for the purposes that are directly 
served by the inverse. Table 1, therefore, 
like the Kelley-Wood and the Kondo—Elder- 
ton table, is simply a convenient supplement 
to such extended tables as those by Glaisher 
(5), Burgess (2), Sheppard (11, 15), and 
Krause and Conrad (9); with the further 
qualification that Table 1 aims merely to sup- 
plement the Kelley-Wood and Kondo—Elder- 
ton tables, for the higher values of % a (i.e., 
Yaa > 4500). 


VII. CONSTRUCTION OF THE TABLE 


(1) Interpolation into the Kondo-Elderton table. 
The Kondo—Elderton table presents ten-decimal! val- 
ues of x/¢ corresponding to values of a between 
.000 and .499; interpolation into this table gave the 
414 values of x/¢ corresponding to values of 42 
between 4500 and .4913, inclusive. The interpola- 
tion was completed in two steps: first, interpolation 
into fifths (giving values of x/¢ for %4a= 4502, 
4504, 4506, 4508, etc.), then interpolation into the 
middle (giving values of x/¢ for %a= .4501, .4503, 
4505, .4507, etc.). Bessel’s central difference formula 
(13, p. 67) was employed, with differences as high 
as the seventh order, when required.* 

*Ten decimals were retained for the interpolation into 
fifths; for interpolation into the middle, eight decimals only 
were used, since the values obtained by interpolation into 


fifths were not sufficiently accurate to justify ten-decimal 
computation. 


For the 67 values of 1% @ between .4914 and .4980, 
inclusive, the interpolation by Bessel’s formula was 
supplemented (1) by backward interpolation using 
Newton’s formula (13, p. 45), and (b) by inverse 
interpolation into Sheppard’s table* (11, 15), using 
third-order divided differences (17, p. 22, formula 1). 
Since the results from these methods were not always 
in agreement to the desired number of decimals (par- 
ticularly for the higher values of %42), they were 
regarded as approximations, to be checked and im 
proved by the technique of “synthetic division” (sec 
Section (2) below). 


As a check on values of x/¢ occasionally doubtful 
in the last decimal (whether obtained by interpola 
tion or synthetic division), interpolation into Burgess’ 
extensive table was next applied (see Section (3) be- 
low). Doubtful values by interpolation into the 
Kondo—Elderton table included those in which the 
sixth decimal was uncertain, due to the closeness 0! 
succeeding decimals to ‘‘50”*; doubtful values of x/° 
by synthetic division were those which yielded re- 
mainder-terms of opposite sign but nearly equal maz 
nitude (see (2) below). The check by interpolation 
into Burgess’ table in every instance confirmed the 
original computation. 


(2) Synthetic division. The area under the unit 
normal curve is expressible (7, p. 96) by the infinite 
series 


. N x 1 x a 1 x ) 
‘oS (OC ee ae. open a "SS 
“a ve leva snilcv) Fail ovi 


1 sy, a 
—_!_(—=) +..... | ) 

7°3! e 5) 
* Thus, at % a= .4573, interpolation by Bessel’s formula 
ve x/o = 1.720178499, so t the value of x/o (to six 
imals) is 1.720178. But if the value 1.720178499 were too 
small by as little as .000000001, value of x/o (to six 
decimals) would become 1.720179. 


* Sheppard’s table gives seven-decimal values of the area 
¥a corresponding to two-decimal values of the abscissa, 


a) 
x/o. Tre values of % (1+ a) are converted to values of 
% a, by subtracting .5 from each of Sheppard's entries. 








“Pe OAT RE 


M 


Fr 


—~ Ph oe @ *e 


-~_ a me 








No. 


4980, 
la was 
using 
inverse 
using 
ula 1). 
always 
} (par- 
" were 
id im 
” (see 


ubtfu! 
rpola 
urgess’ 
3) be- 
Oo the 
th the 
ess 0! 
of x/o 
ad re- 
mac 
lation 
d the 


+ unit 


ifinite 


E ) 
/2 


ormula 
to six 
re too 
to six 


area 
scissa, 
les of 








Sas | a Sabaalaas 


March, 1937) 


From this (assuming a unit normal curve, in which 
N=1 and ¢=1) we may readily obtain 


3 


— 1 2x ; 
Om — G59) V+ S555 2° 81 F 

: ¥ 

7°31 Qt eee (2) 


which is virtually the same as the formula given by 
Burgess. If x is not too large, a close approximation 
to the numerical limit of series (2) may be calculated 
by the familiar method of synthetic division (6, pp. 
90-93). The values of x/¢ corresponding to values 
of %44 between .4914 and .4980 were determined by 
this method.* Difficulty arose only when, in the 
synthetic divisions, two adjacent trial-values of x 
yielded remainder-terms of opposite sign but nearly 
equal magnitude. In such a case, as stated in sec- 
tion (1) above, the method of synthetic division was 
supplemented by interpolation into Burgess’ table (see 
below). 

Interpolation into Burgess’ table would have been 
more efficient than synthetic division as a means of 
checking and improving the approximations men- 
tioned in section (1). We employed synthetic divi- 
sion, first because Kelley (7, p. 96) mentions his use 


+ Burgess’ formula 9 (2, p. 262) gives the area a, instead of 
Ya, as a function of ¢, instead of x/o. (The relation be- 


x 
tween ¢ and x/o is given by ¢ ov? =) 

* The coefficients of x in series (2) were used correct to ten 
significant figures. The number of terms to be retained in the 
series was decided by the rather arbitrary rule to include 
enough terms that, when the highest power of x was multi- 
plied by its coefficient, the product should be negligible to ten 
decimals. (In evaluating the highest power of x for this 
multiplication, it was, of course, necessary to use not the cor- 
rect (as yet unknown) value of x, but the closest available 
approximation.) The highest power of x employed was the 
fifty-seventh. Since the decimal values of the coefficients of x 
in series (2) may be of some general interest, they have been 
preserved in Table 3. The value of V 2 7, correct to fifteen 
decimals, is given in reference 3, p. 207. 


THE KELLEY-WOOD TABLES 


283 


of series (1) for the determination of values of x/¢, 
and second because we were, at the time, unaware 
of the existence of Burgess’ table. 

(3) Interpolation into Burgess’ table. In the re- 
gion of Burgess’ table employed by us, fifteen-decimal 
values are given of the area @, corresponding to val- 
ues of t. Dividing each of Burgess’ entnes by 2 
gives a table of 4%, applicable for our present pur- 
poses. 

For the 19 values of 4a between 4981 and 4999, 
and the 8 values of 4% between 49991 and .49998, 
approximate values of ¢ (and thence of x/o*) were 
obtained by linear inverse interpolation into Burgess’ 
table. The use of Burgess’ table to check and im- 
prove the accuracy of these approximations may be 
illustrated by the determination of the six-decimal 
value of x/¢ for 4%a= 49997. By linear inverse 
interpolation, the value of t corresponding to the area 
49997 is 2.8374879....; this, in terms of ¢, equals 
4.01281 (to five decimais*). The value of ¢ corre- 
sponding to 4.01281 ¢ is 2.83748 51626 132. By in- 
00011 68211. Evidently the correct value of x/¢ is 
terpolation into Burgess’ table.+ we find that the area 
(4%) corresponding to t= 2.83748 51626 132 is 
.49996 99998 96886; this falls short of the desired 
area of .49997 by .00000 00001 03114 — indicating 
that the trial value, x/¢ = 4.01281, is somewhat too 
small. This fact was verified by finding the value 
of %«@ corresponding to the abscissa x/¢ = 4.01282 

* Values of ¢ were converted to values of x/o by multipli 
cation by V2 (= 1.41421 35623 values of x/o wer 
converted to values of ¢ by multiplication by 1/V2 ( 70710 


67811 86548) (the numerical values of V 2 and its reciprocal 
are given in reference 12, p. 262). 

* Linear inverse interpolation did not supply six-decimal es- 
timates of values of x/o with sufficient accuracy to permit 
efficient use 

+ The interpolation was performed by Bessel’s formula (13 
p. 67) using m to ten decimals: differences thr h the third 
were sufficient for the accuracy desired. Ten significant figures 
or fifteen decimals were used for the first difference; the sec- 
ond and third differences were taken to fifteen decimals. 


73095) 


TABLE 3 


COEFFICIENTS OF x IN THE EXPANSION OF THE INTEGRAL OF THE NORMAL CURVE 


Power 

ofs | Coefficient 

1 1.00000 00000 

3 0.16666 66667 

5 25000 00000 (10 -) 
. 4 29761 90476 (10 -) 
9 | .28935 18519 (10 -*) 
11 | .23674 24242 (10 -*) 
is 16693 37607 (10 -) 
15 .10333 99471 (10 -) 
17 | 56988 94141 (10 -) 
19 28327 83637 (10 —) 
21 | 12814 97360 (10-1) 
23 .53184 67303 (10-1) 
25 20387 45800 (10-1) 
27 . 72604 90739 (107) 
29 24142 02586 (10-%*) 
31 . 75281 58601 (10-1) 
33 22099 70801 (10-1) 
35 0.61284 90457 (10-7) 











Power 
ofx | Coefficient 
37 0.16103 39084 (10_2) 
39 ‘40204 14718 (10-%) 
4) "95607 42316 (10-%)?* 
43 "21704 89673 (10-77 
45 "47136 89694 (10-9) 
47 "98111 02509 (10-*) 
49 "19605 51947 (10-*) 
51 ‘87673 35114 (10-4) 
53 ‘69714 83701 (10-**) 
55 CO "12440 69482 (10-2) 
57 | "21486 03431 (10-9) 
59 CO "35705 84323 (10-*) 
61 | 57558 59975 (10-*) 
63 "89889 26228 (10-*) 
65 | ‘13613 03732 (10-*) 
67 20010 11817 (10-*) 
69 | "28573 69816 (10-*) 
1 «| 0.39669 72179 (10-*) 
| 








——E—————————e 
ee een en Se 


LO EEL LE I II ay 


ee ee ee ee 


net a reel 
_ RBA 


epeeererne 


; 


Ss 


284 JOURNAL OF EXPERIMENTAL EDUCATION 


(t = 2.83749 22336 810): this area equals .49997 
00011 68211, which exceeds the desired area by .00000 
between 4.01281 and 4.01282. We try now, a six- 
decimal value of x/¢, estimated on the basis of the 
two discrepancies from .49997 just found for the 
five-decima! values; this estimate is x/¢ = 4.012811,* 
or t= 2.83748 58697 200. By interpolation into 
Burgess’ table, the area corresponding to ¢ = 2.83748 
58697 200 is .49997 00000 24020: this exceeds the de- 
sired area of .49997 by .00000 00000 24020. The 
area corresponding to the abscissa 4.012810 ¢ (t= 
2.83748 51626 132) is .49996 99998 96886: this falls 
short of the desired area of .49997 by .00000 00001 
03114. Since the area obtained for 4.012811 ¢ yields 
the smaller discrepancy from the desired area, .49997, 
the value 4.012811 ¢ is accepted as the correct six- 
decimal value corresponding to the area %4a= 
49997. 


The technique described above may appear more 
cumbrous than inverse interpolation using higher- 
order differences (17, p. 79). It has, however, the 
advantage of providing an immediate self-check. 
Such a check is especially desirable when (as in the 
terminal section of Table 1) a difference-table of re- 
sults fails to provide a thoroughly trustworthy indi- 
cation of accuracy. 

(4) Interpolation into Glaisher’s table. Beginning 
with ¢ = 3.0, the interval of the argument in Bur- 


gess’ table jumns from .002 to .1. For this reason, 
*@ 


e* dt, 
t 
extending from t= 3.00 to t= 4.50, becomes better 
adapted to our purpose, and was used for values of 
%@ from .49999 up. Glaisher’s table was first con- 
verted to a table of 2 ¢ (with argument ¢) by multi- 
plying each of Glaisher’s entries by 1/ V ™ (= .56418 
95835*), and subtracting the product from .5. 

Glaisher’s table (i.e., the converted table of %4 a) 
was used in essentially the same manner as Burgess’ 
table (see section (3) above). The entries in Glai- 
sher’s table, however, are given to fewer decimals, 
and are also probably less accurate,* than in Burgess’. 
In consequence, it occasionally appeared more or less 
uncertain which of two adjacent values of x/o was 
the correct one, since the two values led to areas 
which differed from the desired value of %4@ in 
almost equal amount (to the number of decimals 
available).+ In such a case, a more accurate deter- 
mination of the areas corresponding to the two val- 
ues of x/o was obtained by means of Schlémilch’s 
formula (see section (5) below). The computation 
by Schlémilch’s formula in every case confirmed the 


Glaisher’s table (5) of the Error-function, 


* Obtained by the computation, 


4.01281 + ( ) (.00001) = 4.012811. 


103114 + 1168211 

* The value of 1/Vm is given to twenty-three decimals by 
Burgess (2, p. 279), and to fifteen decimals in Barlow’s Ta- 
bles (3, p. 207). 

*Glaisher states that his table, in certain portions, “may 
be in error to the extent of one or even two units’ in the last 
place (5. p. 435). For the 21 values between ¢ = 3.0 and 

= 5.0, Burgess (2, p. 321) presents the values of Glaisher’s 
Error-function to fifteen decimals. These values by Burgess 
provide a comparison for every tenth value in Glaisher’s table 
At ¢=3.6, 3.7, and 3.8, the last decimal in Glaisher’s table 
differs, in each case, by 1 from the value in Burgess’ table (2). 

t This situation did not arise in the interpolation into Bur- 
gess’ table. in part because of the greater number of decimals 
given by Burgess (2) 


[Vol. 5, No. 3 


value of x/¢ indicated as correct by the interpolation 
into Glaisher’s table. 

(5) Schlémilch’s formula. For high values of x/c, 
the area 4%4— 14a (see Figure 1) may be calculated 
to a greater degree of accuracy by Schlémilch’s 
formula* than by interpolation. Assuming a unit 
normal curve (in which both N and ¢ equal 1), this 
formula, to the number of terms employed by us, 
is written 


1 1 er. 
Var 0 48 dex — hex Va ne~ 72% 
Eee 
x (+2) x(x + 2) (x*+ 4) 
nS SS Peek, ae 
x” (x? + 2) (x7 + 4) (x7 +6) x? (x74 2).... (274+ 8) 
Tr: = fee 
x” (x? + 2)....(x* + 10) 
a? a 
P(x + 2) gg Ge Ba eeeaerense ‘ 


As stated in section (4) above, resort was had to 
Schlémilch’s formula, when results by interpolation 
into Glaisher’s table were not entirely trustworthy to 
the number of decimals desired. The results by ap- 
plication of Schlémilch’s formula are converted to 
values of 44a merely by subtraction from 55. 


SUMMARY 
A table is presented of the abscissae (val- 
x . 
ues of —) corresponding to areas under the 
o 


unit normal curve. The areas of the table ex- 
tend from %a==.4500 through “%a= 
.49999 99999; the abscissae extend from 
1.644854 o through 6.3613 o, respectively. 


x , 
For the most part, the values of — are given 
Cc 


to six decimals; but for the higher values of 
\Y% a (beginning with .499991), only five deci- 
mals are given, and for the last few values of 
4 a, only four decimals. 

The table will be found a convenient sup- 
plement to the Kelley—Wood and the Kondo 
Elderton tables. Both these tables proceed 
by the uniform interval of .oo1; this interval, 
though adequate for lower values of %a, 
may for higher values be considered rather 
large (especially if one prefers to use the ta- 
bles directly, without interpolation). Neither 


* “Schlémilch’s formula” is given by, Pearson, to six terms 
(10, p. 15); the denominators of itional terms may be 
written by inspection, but the numerators must be computed 
by the formulas given by Schlémilch (14, p. 266). Our cal- 
culation of the numerator of the seventh term agrees with that 
given by Pearson in another connection (12, p. xxxviii). 

For accurate results by Schlémilch’s formula, it is desirable 
to use an extensive le of logarithms (such as that by) 
Bauschinger and Peters (1)), and to refine the logarithms and 
antilogarithms of the table either by inte: tion or (more 
efficiently) by the factor method (18, pp. viii-ix). 








Ma 


the 
ble 


wit 
tat 
Sol 
de 
in 

mi 
de 








lation 


of x c, 
Culated 
milch’s 
a unit 
ie this 
by us, 


ad to 
lation 
hy to 
y ap- 
ed to 


( val- 
- the 


e €X- 
a= 
from 
vely. 


‘iven 


Ss of 
leci- 
s of 


sup- 
ido 

eed 
val, 
/2 a, 
ther 
ta- 
ther 


terms 
y be 
uted 
’ cal- 

that 


rable 
t by 
| and 


more 





, 
a 





March, 1937| 


the Kelley-Wood nor the Kondo—Elderton ta- 
ble extends beyond the area 4 a= .499 


The entries in Table 1 have been calculated 


with great care, and checked by supplemen- 
tary techniques whenever there was any rea- 
son to doubt the accuracy of the last retained 


decimal. 


Since the supplementary calculation 


in every case confirmed the original, the table 
may be considered correct to the number of 
decimals given. 


uw 


~I 


. Burgess, J. 


. Glaisher, J. W. L. 


BIBLIOGRAPHY 


. Bauschinger, J. and Peters, J. Logarith- 


mic-Trigonometrical Tables. First Vol- 
ume: Table of Logarithms to Eight Places 
of All Numbers from 1 to 200,000. Leip- 
zig: W. Engelman, rgro. 

“On the Definite Integral 
- » 

t. | e? dt, with Extended Tables 
V2./ 0 

of Values.” 
burgh, 39: 


Trans. Royal Soc. Edin- 
257-321 (1896-1899). 


. Comrie, L. J. (ed.) Barlow’s Tables 


(third edition). London: E. and F. N. 
Spon, 1935. 

. Deming, W. E. and Birge, R.T. “On the 
Statistical Theory of Errors.” Reviews 


of Modern Physics, 6: 119-161 (1934). 
“On a Class of Defi- 


nite Integrals—Part II.” Philos. Mag. 


and Jour. Sci., 42 (Fourth Series): 421- 
436 (1871, Part IT). 

. Hawkes, H. E. Higher Algebra. Bos- 
ton: Ginn and Co., 1913. 

. Kelley, T. L. Statistical Method. New 
York: Macmillan, 1924. 

. Kondo, T. and Elderton, E. M. ‘Table 


of Normal Curve Functions to Each 
Permille of Frequency.” Biometrika, 22: 
368-376 (1930-1931). 


. Krause, R. H. and Conrad, H. S. “A 


Seven-Decimal Table of the Area (a) un- 


10. 


Il. 


12. 


13. 


14. 


15. 


16. 


18. 


19. 


. Steffensen, J. F. 


THE KELLEY-WOOD TABLES 285 


der the Unit Normal Curve, for Abscissae 
Expressed in Terms of P. E.”’ Accepted 
for publication in the March, 1937 issue 
of Psychometrika. 

Pearson, K. On a Novel Method of Re- 
garding the Association of Two Variates 
Classed Solely in Alternate Categories. 
Draper’s Co. Research Memoirs: Bio- 
metric Series, VII. London: Dulau and 
Co., 1912. 

Pearson, K. (ed.). Tables for Statisti- 
cians and Biometricians. Part I (second 
edition). London: Biometric Labora- 
tory, University College, 1924. 

Pearson, K. (ed.). Tables for Statisti- 
cians and Biometricians. Part II. Lon- 
don: Biometric Laboratory, University 
College, 1931. 

Rice, H. L. The Theory and Practice of 
Interpolation. Lynn, Mass.: Nichols 
Press, 1899. 

Schlémilch, O. Compendium der Hiheren 
Analysis (second edition), vol. 2. Braun- 
schweig: Friedrich Vieweg und Sohn, 
1874. 

Sheppard, W. F. “New Tables of the 
Probability Integral.” Biometrika, 2: 
174-190 (1902-1903). 

Sheppard, W. F. “Table of Deviates of 
the Normal Curve.” In “Grades and 
Deviates,” by Francis Galton. Biomet- 
rika, 5: 400-406 (1906—1907). 
Inter polation. 
more: Williams and Wilkins, 1927. 
Thompson, A. J. Logarithmetica Britan- 
nica, Part [IX (Tracts for Computers, No. 
XI). London: Biometric Laboratory, 
University of London, 1924. 

Walker, H. M. Studies in the History of 
Statistical Method. Baltimore: Williams 
and Wilkins, 19209. 


Balti- 








ee: 





; : 
5 


ti 


COMBINATIVE PROPERTIES OF COR RELATION COEFFICIENTS 


Jack W. 


Fordham 


In analyzing data it is often advisable to 
separate the material into certain natural 
groups, such as age, grade, sex, school, etc., 
and compute the desired statistics for each 
group. In many cases, however, it is neces- 
sary in addition to secure the same statistics 
for various combinations of these subgroups, 
and this necessitates further computation. 
This computation is particularly annoying in 
the case of the correlation coefficient. Again, 
in examining published material, one often 
wishes to combine data for various groups. 
Sometimes data from a number of sources are 
available, such as data on the correlation be- 
tween two tests (say the Henmon—Nelson In- 
telligence Test and the Columbia Research 
Bureau Algebra Test) from the published re- 
ports of several different workers. The prob- 
lem then is how to combine these correlations 
into a single value such as would be found if 
all the original data could be put on a single 
scatter diagram. |This paper presents a 
method for combining correlation coefficients 
when all the subgroup means, standard devia- 
tions, and correlations are available. 


The method of combining means from sev- 
eral sets of data is, well known, but is given 
again here in order to present a complete de- 
velopment of the equations for combining 
correlations. 

N 
sX 
M,=—, 
N (1) 


Where M, is the mean of all X-values, N the 
total number of cases (composed of two 
groups m and n, so that N=-m-+-n), and 


N 

=X is the sum of the NV values of the variable 
X. Every summation in the present paper is 
to be taken as proceeding from the first value 


DUNLAP 
University 


But m 
SY 
M,, 2 _ ’ 
m 
and n 
M, on. 
n 


so that equation (1) may be rewritten as 


SXn-+ SX, 
€, 5 , (3) 


where >X,, refers to all scores in the first 


a 
or m-group, and SX, refers to all scores in 
the second or m-group. This expression may 
be rewritten in terms of the means of the two 
groups as 
mM nM 
M,=— m+ 1My 


m—+_n , 


and in general, 
mM + My +... +RMi 


a <<. 7oee ) wo 


The relationship between the standard de- 
viations of the two groups and that of the 
combined group, while not quite so simple, is 
also generally known, but is repeated here for 
the sake of completeness. 


N 
> 
o*, =e (5) 


where x is a score taken as a deviation from 
the mean, M,, and N is the total number of 
cases, as before. Consider N again as com- 
posed of two groups, m and m, whose means 
and variances are 








m 
of the variable to the one indicated (the N’th, Oe a =Xm = __ 3**n 
m’th, n’th, or &’th), and in writing formulas - m’ 7" m ’ 
the lower limit of the summation is therefore 
omitted, but understood always as 1. 0 
N m n M. = =X, oo, == 32*, 
SX =3X + 3X (2) . ee , n 
286 





The 
two 


an¢ 


Sul 


Ne 


Tt 








on as 


first 


es in 
may 
4 two 


(4) 


1 de- 
| the 
le, is 
e for 


(5) 


from 
ar of 
com- 
eans 








March, 1937) 


The x?-values in equation (5) come from the 
two groups m and m, so 
2 __ 2%*m + Sx", 


2 m + n (6) 
Now, 
x= X — M,, 
and pages es 
Subtracting, 
x — Xm = M,, — My. (7) 
Now set 
M,, — M, = 8n. (8) 
Then 
¥ = Xm + Bn. (9) 


m Db 
Expressing the =x*,,-values and the 3x’,-val- 
ues in terms of deviations from their own 
means and the means of all the scores by 
equation (9) and an analogous equation for 
the n-group, and substituting in (6), 





m QD 
2 S(%m + 8n)? + S(x, + 8,)? 
Co x — 
m+n : 
which on expanding and reducing gives 


__ m(o%n + 8n) + (07s + 8s) 


—" m+n ; 





(10) 


and in general, 
o°, —=[m(o°m + 8m) + n(o7n + 8) +... 
+ k(o*% + 8 )| [m+n 4-... + ).(11) 


The combination of correlation coefficients 
is slightly more complex. 


N 
axy 
a Teg, ai 
Again consider the expression as composed of 
values from two groups, m and m. The nu- 
merator is the only expression for which new 
combinative relations need to be derived. 


PROPERTIES OF CORRELATION COEFFICIENTS 287 


Let 


N m Db 
SLY = TXmVm + SXnVn; 


where the subscripts m and n refer again to 
the groups to be combined. Any x- or y-val- 
ues in group m can be expressed as deviations 
from the means of the total group by the ex- 
pressions, 


X == Xm + 3n, Y = Vm + An, 


and any x- or y-values in group m can be ex- 
pressed similarly, 


x= x, + 3, y= Jn-+ A,, 
where 
8n = M,,, — ™,, An = M,, — My, 
3, = M,, — M,, 4, = M,, — M,. 


Rewriting the product-moment expression in 
terms of the deviations from the means of the 
subgroups and the general means by the use 
of these relations gives, 


Sxy = 3(%m + 8m) (m+ An) + 
S(t, + 83) (Yo-+ An), 


which on expanding, collecting, and reducing 

becomes, 

N 

=xy — MO x! im¥m 2 MPnAm + NOx, Fy x¥y 
+ 8,A,. 


The correlation for the combined groups may 
then be expressed in terms of the correlations, 
standard deviations, and differences between 
the subgroup means and the means of the 
total group, as 





MO x5, Fy on! Xn¥en TM BmAm + Nox, Fy,.!s,9,, + MnAn 





"xy 


, (13) 





and in general, 


i, [men maP ems " a NOx, Fy Tx ny aa 


4. M8mAm + MByAa + 
+ n(*z, + 8a) + 
v n(o*,, + 4°.) + 


~ Venter + Ba) Fes, FR) Viney + Be) + m0%, +O) 


© + Rox, Oy, Fx, 9, 
+ kbs | = [ (0° + 8m) 
+ Rory +84) ]*[ m(o%s, +4%) 
. + Rlo%n + 0%) | *. (14) 





i ct i, nial 


wince Mita itt we i, Sh ih allt NDA Ahn a aaa AEE A A LOIS LALLA 


Te 2 





% 
x 
¥ 


288 JOURNAL OF EXPERIMENTAL EDUCATION 


This formula gives an exact value for the 
combination of sets of correlations and is in 
no sense an approximation. By means of it 
the investigator can quickly and exactly de- 
termine what the correlation would be if the 
several groups of data had been combined in 
one original scatter diagram. 

If it is known or can be assumed that the 
groups are identical as to size, and that all 
x-means and all y-means are equal, formula 
(14) reduces to 


ia | ox, Oy x¥m + Ox, Oy,7 x,y, 
+ em ~ Ox FyF ivi [ O7x1n + ae ; i 
ee botgot)4, +0%,+ »- ofp} i(r5) 


If in addition to the above restrictions we 
know or are able to assume all x-standard de- 
viations and all y-standard deviations are 
equal, formula (15) may be simplified still 
further to, 


ry y Ty we Tx, Y 
= 3m) m + 3p Ms + kk (16) 





where & is the number of groups combined. 

The common practice of averaging corre- 
lation coefficients makes all the assumptions 
inherent in formula (16), namely: (1) equal- 
ity of numbers in the groups combined, (2) 
equality of all x-means, (3) equality of all y- 
means, (4) equality of all x-standard devia- 
tions and (5) equality of all y-standard devia- 
tions. These assumptions, taken altogether, 
are equivalent to the assumption that we have 
matched groups. Obviously, in practice, the 
simultaneous fulfillment of all these condi- 
tions is seldom if ever met with except in the 
case of deliberate matching. Thus the prac- 
tice of getting a simple average of a series of 
correlations by means of formula (16) is 
rarely justified, and may lead to serious er- 
rors. In general, when combining correla- 
tions, formula (13) or formula (14) should 
be used. . 

Since the use of formulas is not restricted 
to statisticians, a few comments on the limi- 
tations of formulas (13) and (14) may be 
appropriate. While these formulas give the 
correlation that would be obtained if two or 


(Vol. 5, No. 3 


more groups were mingled, they do not in any 
way justify such mingling. It is well 
known that if variables X and Y are uncorre- 
lated in each of two groups, they will exhibit 
correlation when the records are mingled if the 
mean of the X’s in the second group differs 
from the mean of the X’s in the first group 
and/or if the mean of the Y’s in the second 
group differs from the mean of the Y’s in the 
first group. The greater the differences be- 
tween means, the greater the increase in the 
correlation. Thus combining widely different 
groups, such as a third grade and an eighth 
grade, or a group of literate children and a 
group of illiterate children, or a group of ele- 
mentary school children and a group of col- 
lege students, gives rise to some considerable 
amount of correlation due to mere increase of 
range. The effect of wide grade-ranges on 
the reliability coefficient is generally known, 
and the authors of most modern tests report 
reliability for single grade-ranges. Kelley* 
has provided a formula for estimating the re- 
liability in one range, knowing the reliability 
coefficient in a second range and the standard 
deviations in both ranges. 

In applying the above formulas, therefore, 
the investigator should assure himself in ad- 
vance that the data come from the same 
grade-ranges, the same age-ranges, or some 
other homogeneous universe. The formulas 
can of course be applied to any set of data, 
and will give the correct arithmetical result, 
but the interpretation of the result when ob- 
tained must be precisely that which would be 
made if the original records had been mingled 
before the correlation was computed. 

Fairly frequent requests for an equation for 
combining correlations, together with some 
consideration of the number of published stud- 
ies wherein correlations are simply averaged 
(formula (16)), indicated that if this formula 
((13) or (14)) is in print it is not generally 
available or known. A fairly extensive sur- 
vey of textbooks and of the literature sub- 
stantiated this belief. It seems incredible 
that such useful formula has not already ap- 
peared, and the only reason for the publication 
of this paper is to make it generally available. 


* Kelley, T. L., “The reliability of test scores”. J. Educa- 
tional Research, 1921, 3, 370-379. 





mers 


ea EH HE 








No. 


in any 
S well 
1corre- 
exhibit 
| if the 
differs 
group 
second 
in the 
es be- 
in the 
ferent 
eighth 
and a 
of ele- 
f col- 
erable 
ase of 
es on 
nown, 
report 
elley* 
he re- 
bility 
ndard 


efore, 
n ad- 
same 
some 
nulas 
data, 
esult, 
n ob- 
Id be 
ngled 


n for 
some 
stud- 
mula 
rally 
sur- 
sub- 
dible 
, ap- 
ition 
able. 


educa- 








So ey Nagy eae 


A NOTE ON THE COMPUTATION OF THE CORRELATION 
BETWEEN TWO LENGTHENED TESTS 


Haro_p A. EDGERTON 
Ohio State University 


The tables published by Edgerton and 
Toops in the Journal of Educational Re- 
search* may be used to solve the formula for 
estimating the correlation between m forms of 
test X and n forms of test Y, when the corre- 
lation between X and Y and the reliabilities 
of the two variables are known. 


The formula is 


1 mx)(2Y) = _ 
xy 


Vm +m(m—i)t Vn+n(n— ii 


where 7, is the average of the correlations of 
x with Y and 7,, and 7yy are the average in- 
tercorrelations of the m forms of test X and 
the m forms of test Y, respectively. 

In actual practice the best estimate of f,, 
is taken to be the correlation of X and Y, and 
the best estimates of 7,, and 7yy are taken to 
be the respective reliability coefficients. 


The formula given may be rearranged 


'(mX)(nY) = 




















™ v5 + =e Tex v5 + “pa I)Fyy 


The quantities under the radicals are the 
quantities “K” given in the table by Edger- 
ton and Toops. 


* Edgerton, Harold A. and T A.: “A Table 
for Predicting the ware and and Relay Coefficients of a 
vat when gt of Educational Research, 
. 18, pp. 225-234 1528). 


Using the notation of the table, 


1 (mx)(n¥) = TxyK, Ky 


when K, and K, are the tabled values which 
are functions of m and 7,,, and of m and 7yy,, 
respectively. Thus the expected correlation 
of test X, when lengthened to four times its 
present length (m= 4) with test Y, length- 
ened to twice its present length (m — 2), when 
sy = .40, Txx = .65 and ryy == .73 is, accord- 
ing to the rearranged formula, 


2 
raven ony 4 y - 


1 — 3 (.65) 1 (.73) 








which becomes 





(0.40) V4/2.95 V2/1.73 = 
(0.40) V1.3559 V1.1561 = 
(0.40) (1.164) (1.075) = .5005. 


Using the table K, and Ky are found to be 
1.164 and 1.075 respectively. Hence, 


T(4x)(2¥) == (0.40) (1.164) (1.075) — .5005. 


Thus the solution of the formula is a mat- 
ter of looking up two tabled values and ob- 
taining the product of three numbers, in all 
much easier than the full computation indi- 
cated. 





289 





a ten i ie... dmll 











aE ated 


(oT re ae oars 
aac E, nga: ee cai. 0 Y ‘tne imitate. Pca Pik 


EE ETO 


arre* 


ir 


A NOMOGRAPH FOR ESTIMATING THE VALIDITY 
COEFFICIENT OF A LENGTHENED TEST 


Harotp A. Voss 
Fordham University 


Knowing the validity and reliability of a 
test, it is possible to estimate the increase in 
validity due to lengthening the test by means 
of a variation of the Spearman Brown Phoph- 
ecy formula. Although the formula has been 
reported in a number of forms, all of these are 
algebraically equivalent to the following: 
where 


Vie 


I—T,1 
/ 1 +r 


N 








Vi= 


Vn = the estimated validity coefficient. 


V,, = the correlation between the test and the 
criterion, or the average correlation be- 
tween more than one form of the test 
and the criterion. 


r,, == the reliability coefficient or the average 
reliability coefficient. 


N = the number of times the test is length- 
ened or the number of tests added. 


Given the values of any three of the vari- 
ables, the value of the fourth can be deter- 
mined from the accompanying nomograph. 


1. To solve for V,, given 7,1 = .83, V,,= 
.60, N= 3. 


a) Connect with an isopleth or ruler 
the values of 7,,;==.83 and N = 
3 on the proper scales. The hair- 
line on the isopleth cuts the X 
scale at 2.3. 


b) Pivot the isopleth at the point 2.3 
on the X scale until it meets the 
proper value, .60, on the V,, scale. 
The isopleth intersects the V, scale 
at .637. 


2. To solve for V,,, given 7: .83, Va= 
.637, N = 3. 

a) Repeat. 

b) Pivot the straightedge where it 
cuts the X scale until it meets the 
proper value, .637, on scale V,, 
Read the value .60 on the V,, 
scale. 

3. To solve for N, given 7,; = .83, V,, 
.60, Vz = .637. 

a) Connect the value, .60, on scale 
V,, and .637 on scale V,. 

b) Pivot the isopleth where it cuts 
scale X until it meets the value, 
.83, on scale r,;. The value, 3, is 
then read where the isopleth cuts 
the NV scale. 

4. To solve for 7,1, given V, = .637, V,, = 
.60, N= 3- 

a) Repeat. 

b) Pivot the isopleth where it cuts 
scale X until it meets the value, 3, 
on scale V. .83 is then read at the 
point where the 7,; scale is cut. 


REFERENCES 


Holzinger, K. J., Statistical Methods for Stu- 
dents in Education. Boston, Ginn and 
Company, 1928, p. 170. 

Hull, C. L., “The Joint Yield of Teams of 
Tests.” Journal of Educational Psychol- 
ogy, XIV (1923), 396-406, formula 1. 

Kelley, T. L., Statistical Method. New York, 
Macmillan Co., 1923, p. 200, formula 152. 

Lindquist, E. F. and Cook, W. W., “Experi- 
mental Procedures in Test Evaluation.” 
Journal of Experimental Education, | 
(1933), 163-185, p. 166. 

Odell, C. W., Statistical Method in Education. 
New York, Appleton—Century, 1935, pp. 
212-214, p. 213. 


290 











PS ECL 
: 


M 








ere it 
ts the 
le V,,. 
e V 


12 
scale 


} Cuts 
value, 


+3) is 
1 cuts 








; YR ee NON 





March, 1937] NOMOGRAPH FOR VALIDITY OF TESTS 291 


+ .80 


-.70 


~ 60 


-.50 

















X A 1 
F IS a 7 
V. 14 
V ™ 12 F 
n |-z : 30 + 
) a ~— 13 
E12 
aT 
\9 ; 
4 
9 3 4 
® 10 
4 7 
9 
: 3 
6, , 8 .50 - 
V 4 3 1 
40 7 | 
® 0 & 4 
” 60 — 
| 
{ 
-80 : 
70 - 
70 | 
‘60 5 | 
J 
.80 - 














ee eae Es EL re ccigliliagea Aliph ie 





A CHART TO FACILITATE THE COMPUTATION OF THE 
CORRELATION RATIOS 


SYDNEY ROSLOW 


Robert Louis Stevenson School 
New York City 


This chart is designed to aid the research 
worker in psychology and education in the 
computation of correlation ratios with the 
aid of any of the usual calculating machines. 
It often happens that a regression is curvi- 
linear, but that because the procedure appears 
to be complicated, the eta coefficients are 
never obtained. In such an event, the Pear- 
son r does not present a true evaluation of the 
relationship. Because of this situation, the 
present chart* was evolved to systematize the 
computation of the eta coefficients. The sta- 
tistical steps, simplified and arranged in a 
methodical order, are outlined below. 

The chart makes provision for plotting the 
data in the usual way. The uppermost row 
and extreme left column are for entering the 
class or step intervals (C7, and ClI,). In the 
small cells, J, and J,, are entered the sizes of 
the X and Y intervals. The F, and Fy, en- 
tries are made in the usual way by adding the 
cases falling in each cell down the columns and 
across the rows respectively. 

xX and SX? may be obtained on the cal- 
culating machine in one operation. F, multi- 
plied by the corresponding X-deviation and its 
square yields the FX and FX?. This is done 
by dividing the keyboard into two sections; 
the five columns on the right for X, and the 
remaining columns on the left for X?._ Each 
X and X? are respectively placed in the key- 
board and multiplied by the corresponding F,. 
The results of these multiplications are ac- 
cumulated in the carriage dials, and the final 
result will be SX on the right portion of the 
dial, and =X? on the left. A similar opera- 
tion with the frequencies in Y yields the SY 
and =Y*. These results are entered in their 
respective cells at the right of the chart, 
labeled 3X, SY, SX?, and SY. 

The next step is to obtain the =F,X, the 
sum of the frequencies in Y times the X- 
deviations, and =F,Y, the sum of the fre- 
quencies in X times the Y-deviations, which 
chased trom ‘The Papchelosical Corporation’ S02 Fink Avene 
New York City. 


are used in subsequent computations and also 
to check the =X and XY computed on the ma- 
chine. The entries in the 3F,X column are 
computed by working in each row from left 
to right. The frequency in each cell is multi- 
plied by the X-deviation. The products are 
accumulated, and the final sum for each row 
is entered in the column SF,X. The sum 
of this column yields the =X and thus checks 
the =X previously obtained. The entries in 
the 3F,Y row are similarly computed by work- 
ing in each column from top to bottom. The 
frequency in each cell is multiplied by the Y- 
deviation. The products are accumulated, 
and the final sum for each column is entered in 
the 3F,Y row. This row when added, gives 
the =Y which checks the SY obtained on the 
machine. 

Each figure in the =/,X column is squared. 
A table of squares, such as Barlow’s, is useful 
in this process. These squares are entered 
in the column (=F,X)*. Each square is then 
divided by its respective Fy, and the quotient 
(3F,X)?. 

Fy 
is then summed. Similarly each figure in the 
=F,Y row is squared, and the squares recorded 
in the (3F,Y)* row. Then the squares are 
divided by their respective F,’s. The quo- 
(3F,Y)? 

F, 


entered in the column This column 


tients are recorded in the 


the sum is obtained. 


The next step is to obtain the XY products 
which are placed in the last column at the 
right and the last row at the bottom. To do 
this, each entry in the 3F,X column is multi- 
plied by the Y-deviation. This product is 
then recorded in the 3F,XY column. The 
sum of this column is the [=XY. In the same 
manner, each figure in the =F,Y row is multi- 
plied by the X-deviation to yield the products 
to be placed in the cells in the 3F,XY row. 
The sum of this row is again the [XY and 
should check the previous =XY. 

These computations give the basic figures 
to be utilized in the subsequent work of cal- 


row and 


292 





eR RP a F 








HE 


d also 
ie ma- 
in are 
n left 
multi- 
ts are 
h row 
- sum 
hecks 
ies in 
work- 

The 
he Y- 
lated, 
red in 
gives 
n the 


lared. 
iseful 
tered 
, then 
»tient 


umn 


n the 
orded 
S are 

quo- 


r and 


ducts 
t the 
‘o do 
nulti- 
ct is 
The 
same 
nulti- 
ducts 


and 


gures 
 cal- 





March, 1937\ 


culating the coefficients. At this point, the 
=X, 2Y, and =XY, have been automatically 
checked. The SX* and SY* may be checked 
by recomputing or by using the method of 
Charlier’s check. The squares which are en- 
tered in the (3F,X)? column and (3F,Y)? 
row have been obtained from a table of 
squares. These figures can then be checked 
by computing on the machine. If the quo- 
(3F,X)? 
tients which are recorded in the — col- 
y 
(2FxY)? 

F, 
basic preliminary computations will have been 
checked. 


The 2 »’s and the r are computed accord- 
ing to the formulae given on the bottom of the 


umn and—————— row are recomputed, all the 


chart. The next step is to obtain (3X)?’, 
«vis ewe rete (2F,V)? (3F,X)? 
(SV)?,NSX?,NSY?,N F. and V— a 


These figures are entered in their respective 
cells. Following these computations, the four 
radical quantities are calculated, 


COMPUTATION OF CORRELATION RATIOS 293 





— (2X)?, VN2Y? — (2¥)?, 


y/wx'2F") ¥) -{(3X)?,4/ yy =) zY)*; 


F, F, 

and the square roots extracted. These com- 
putations are performed and the square roots 
obtained because each figure is used more 
than once. These equare roots are used in 
computing the correlation ratios, the r, and 
the standard deviations. Having computed 
these quantities once they need not be found 
again. 

The chart also provides for obtaining the 
two standard deviations and the two means. 
The mean of X is found, for example, by di- 
viding the 3X by N. The result is given in 
terms of the class interval in which the mean 
falls. The actual value of the mean must be 
determined by multiplying this quotient by /, 
and adding this result to the mid-point of the 
O step-interval. 

A set of data is plotted on the chart and 
the correlation ratios and the Pearson r are 
computed. 


CORRELATION RATIO CHART 


6) (6 jx 
CIx | 90-86" |72- 
‘ 9 | 


xX VARIABLE 


1AQ- 1AB-\ 134 Mo- 14% 
sal) 4A7 133) (99) 16 1641 


fog - '10- 
97| /03| 10? 


Y VARIABLE 











Y LFyx (LFyx)* LFyXY 


» i 
ad 


wl He ph 


/ 


rT” 13) =NiyYe 
O 


~ G&S GR) 


70134 
=x IXY 
Abi} LY 


(ged? “ni Ga 


NIXY 





UN LUE FyX)2/ Fy] — (2x)? 





SANITY) ?/ Fx) — (LYE _ 


od Zexv* [tors0} 
948 





= | 952 
JNEx? — (rx)? 7xy 

NIXY — Ixry 
Jintx2 (2x92) Nrv2 - 








(zy)*) Puxvy 


Jury? = (ry)? 


"yx SNE XE Tye N =(¥2) 
Tx 


a = era n= 
Cy 





ee Ee ee 


<r te en Rete es ———* 


ON OS ee. See - eee 


REGRESSION AND STANDARD ERROR CALCULATION 
FROM THE NORMAL EQUATIONS 


ALBERT B. BLANKENSHIP 
University of Oregon 


In a recent article, Bronfin and Newhall 
derive lengthy, involved formulae for the re- 
gression equation, regression coefficient, stand- 
ard error of the regression coefficient, and 
standard error of estimate, all in terms of raw 
scores without the use of the correlation coef- 
ficient. Bronfin and Newhall give the equa- 
tions in their final form as: 


N3XY —3X-3Y 
N3X? — (3X)? * 





bya 


(Regression coefficient ) 
* by.y =|} NZX? — (3X)?) INSZV? — (SV)?! 
— |N3XY¥ —(3X)(3Y)| 7) = 
[NV )N3X? ar (=X)?! ee 


(Standard error of regression coefficient) 


N3XY —3X-3Y 
N3X? — (sx)? a 





Y = 


(Regression equation) 
Oy-g =|) NZX? — (2X)?! } NSY* — (3Y)?! 
— IN3&X¥ —(3X)(3Y)| 2] + 
[| N? \N3X? — (3X)?! ]% 


(Standard error of estimate) 


These authors started with the well-known 
equations for the above in terms of reduced 
scores, in each case substituting for 7, ox, and 
oy in terms of raw scores. They may be 
criticized for their method of derivation of the 
equations. It would have been more logical 
and simpler to start with the equation for a 
straight line: 


Y—a+bXx 


This is the generalized regression equation, 
and the problem, as will be indicated later, is 
to find a and 5. Starting with the above 
equation; according to the theory of least 
squares, we are to minimize 3(Y — Y)?*. Let 

1H. Bronfin and S. M. Newhall, “Regression and standard 


error calculation without the correlation coefficient”, J. Ed. 
Psychol., 1934, 25, 634~—636 


f= 3(¥ — V)? = 3(¥Y —a— bX)?. Ex. 
panding the equation, it is found that 
f= 3¥? + Na? + b?3X? — 2a3¥ — 
2b3XY + 2ab3X 


This expression is differentiated first with re- 
spect to a and then to 3b, and the derivatives 
set equal to zero. 

dj 


oan 2Na— 23Y + 2b3X —o. 


~ == 2b3X* — 23XY — 2a3X = 0. 
Dividing each of these equations by 2, and 
transposing, we have the two normal equa- 
tions for fitting a straight line by the method 
of least squares: 
SY = Na+ b3sX. 
SXY = adX + Db3X?. 


Solving the first equation for 5, we find that 


b =¥ — Na 
=X 


Solving, alternatively, for Na, and multiply- 
ing both sides by =X, 


Na&zX = 3X° SY — b(3X)? 


Multiplying both sides of the second equation 
by NV, and substituting for NaSX the value 
obtained above, 


N3XY¥ = 3X-3Y —b(3X)? + NbsX? 


Solving for 5 gives the same equation that 
Bronfin and Newhall list for the regression 
coefficient. 

Solving the first normal equation for a, and 
noting that S¥/N=—M, and 3X/N = M,, 
we find that a= M,--6bM,. Then substi- 
tuting in the equation, Y — a + bX, we find 
that 


Y = 6(X —M,) + M,. 


294 











that 


th re- 
‘atives 


» and 


equa- 
ethod 


1 that 


tiply- 


lation 
value 


xX? 


. that 
ession 


!, and 
= M,, 
ubsti- 
> find 








March, 1937) 


Substituting for b in this equation the value 
previously derived, we obtain the same form 
of the regression equation derived by Bron- 
fin and Newhall. 

A common? form of the standard error of 
estimate, derived from the normal equations, 
follows: 

Y? — aS ¥ — b3XY 
Oy. = "A 2 N ° 








Solving the normal equations for @ and 6 and 
substituting in this equation, we obtain for 
the standard error of estimate the same for- 
mula listed by Bronfin and Newhall. 


It is simple to derive the formula for the 
standard error of the regression coefficient 
from this. We know that 


oyViI—r* 
%,... 
oxVN 


’ 


and that 
oyV I— r? = Oy-x 
Therefore 
i Oy-x 
yx o,VN 


If the derived value, given above, be substi- 
tuted for the standard error of estimate, 


ee y/ 2¥' — a3 — b3xY = 














N?o,2 
/ SY? — aS VY — bSXY 
N3X? — (3X)? 


If Bronfin and Newhall’s formula for the 
standard error of estimate be substituted in 
the formula for ob ,.., together with the gross- 
score value of N*o,?, their formula for the 
standard error of the regression coefficient will 
be obtained. 


Four of the formulas here derived are quite 
useful, and are much easier to use than those 
suggested by Bronfin and Newhall. The four 
formulas are: 


Y—a-+t bX. 
(Regression equation ) 


>¥ —Na 
sX 


Ora = 

? For derivation of this formula, the reader is referred to 

~~ C. Mills, Statistical Methods, New York, 1924, 
p. 377. 


STANDARD ERROR CALCULATION 


te 
© 
wn 


(Regression coefficient ) 
_ of 2h ) — a3¥ — b3X¥ 
N 
(Standard error of estimate) 
_ "A >Y¥? — aS ¥ — b3XY., 
N3X? — (3X)? 
(Standard error of regression coefficient ) 


The normal equations must first be solved 
for a. 














_ -3X?*3¥ — 3X'3XV 
— s N3X*#—(3X)* ° 


Where 3X = 274, SY = 215, N = 50, 
=X* = 1700, SY? = 1145, and XY = 1341, 
it is readily found that a= .1949. By a mul- 
tiplication, a subtraction and a division, 6 is 
found to be .8202. Using Bronfin and New- 
hall’s equation, we must perform four multipli- 
cations, two subtractions, and one division, all 
using fairly large figures. 

An estimated value of Y is quite easily 
found. Since @ and 6 are already known, it 
takes but a short time to substitute in the 
above regression equation. Where X is 5, 


Y = 3.906. Bronfin and Newhall’s formula 
for the regression equation requires five mul- 
tiplications, three divisions, three subtractions, 
and one addition, using large numbers. 

To find the standard error of estimate, it 
requires but two multiplications, two subtrac- 
tions, and one division. Bronfin and New- 
hall’s equation requires ten multiplications, 
four subtractions, and one division. There is 
little reason to spend much time to find the 
figure, in this case 1.319. 

The fourth formula, above, is used to find 
the standard error of the regression coeffi- 
cient only if the standard error of estimate has 
not been calculated. The formula given by 
Bronfin and Newhall is used only in similar 
circumstances. If the standard error of esti- 
mate has been computed by either formula, 
the standard error of the regression coefficient 
is obtained from the formula: 





WX? — (3X)*/N 


Its value in this example is .0936. 








y*x 


(The writee wishes to express his indebted- 
ness to Prof. C. L. Huffacker, of the Univer- 
sity of Oregon, for aid in preparation of this 
article. ) 


HOW SCIENCE MEASURES 


Douctas E. SCATES 


Director of School Research 
Cincinnati Public Schools 


If one examines the various measurement 
procedures in a number of fields of science he 
will find the most elaborate and ingenious de- 
vices for indexing changes in variables, which 
would seem to the uninitiated quite beyond 
the realm of possibility. Modern physical 
science reaches out to the stars, weighs them, 
analyzes their constituents, determines their 
temperatures, calculates the length of time 
it takes for their light to reach the earth, and 
notes the rate at which the celestial universe 
is expanding. And in the other directign, 
science divides matter into units far smaller 
than can be seen through the strongest micro- 
scope, and then proceeds to reveal still smaller 
elements within these units— elements so 
small that the orbits through which they move, 
within the confines of the atom or molecule, 
represent comparatively great reaches of 
space. The differences between one kind of 
matter and another are accounted for in 
terms of the number and structural relation- 
ships of the four kinds of elements which make 
up these minute universes. Similar examples 
of great penetration are to be found in other 
branches of science. 

It is the story of research at large to de- 
scribe how devices for accomplishing these 
measurements have been worked out; 
it is our purpose here to examine the 
measurement procedures which are being used 
in various fields of science, and attempt to 
abstract their essential characteristics. By 
this means we may not only secure a picture 
of the many ways in which measurement is 
now being performed, but we may draw from 
these practices certain general conclusions 
about the nature of measurement. Many of 
the illustrations used are taken from the phy- 
sical sciences because of the high regard in 
which these fields are commonly held. The 
concepts of measurement found in the physical 
sciences may however require some extension 
to cover certain forms of measurement that 
are appropriate to the biological and social 
sciences. No science should lean so heavily 
upon the conventions of another science that 


it neglects to develop procedures appropriate 
to its own problems; and in establishing its 
own procedures it will necessarily modify to 
some extent the concepts that are common to 
the several sciences. 

It would appear that science measures by 
meeting three primary conditions. The first 
of these requirements is a working concept of 
a character that one desires to measure. The 
adoption of the proper concept is of the great- 
est importance in determining the character 
and success of one’s work. Binet, for exam- 
ple, struggled for years to discover the idea of 
what he ought to be trying to measure. Often 
the fortunate choice of a concept of the proper 
character to measure marks the turning point 
in the development of a field of work. We 
shall not here however discuss the appropriate- 
ness of concepts for a given piece of research; 
we are interested rather in noting the genera! 
characteristics of a measurement concept for 
work in any field. 

It appears that the working concept of the 
character to be measured has three important 
characteristics. In the first place, it should 
be a relatively pure abstraction. One has 
merely to consider distance, for example, to 
recognize what a complete abstraction the 
concept is. It is so general that it can be ap- 
plied to anything for which a physical man- 
ifestation is possible; it is so abstract that it 
is uninfluenced by the varying heterogeneity 
and structural complexity of the thing to 
which it is applied. This abstractedness en- 
ables us to throw the entire range of radiant 
energy onto a single scale, and compare elec- 
tro-magnetic waves, heat, visible light, x-rays, 
cosmic rays, and others, all on the basis of 
this single characteristic. Thus, in spite of 
the many differences in the nature and effect 
of these various waves, and in spite of the 
great range in their magnitude (from several 
hundred miles, down to .000,000,000,001 of an 
inch) this single characteristic remains the 
same throughout. 

We are so used to the concept of distance 
that we fail to appreciate the large amount 


296 








a so. s 








to 











March, 1937] 


of mental effort that must originally have 
been required to formulate it as a pure ab- 
straction, and then to generalize upon it for 
all situations, and all magnitudes, until it was 
recognized as the same character, regardless 
of the situation. There are probably at the 
present time a good many characters in most 
sciences that are regarded as separate traits 
because they occur in very different ranges of 
magnitude and have not been .conceptually 
connected into one common trait which dif- 
fers only in degree. 


The appropriate degree of abstraction limits 
concern, for the time being, to the particular 
character which is to be measured, permitting 
the phenomenon being observed to change 
within the range of measurement in many 
ways. The character to be measured is one 
which we abstract from among all other char- 
acteristics of the object, and, however much 
these other characteristics may be associated 
with it, and vary from one degree of magni- 
tude to another, we are concerned with only 
one character while we are measuring that one. 
As one writer has said, “We may count a 
man’s height in inches and assume that an 
inch at his feet is the same as an inch at his 
neck.’* This statement does not, of course, 
mean anything so ridiculous as that an inch 
of feet is the same as an inch of neck; rather, 
it is designed to bring out the complete dis- 
sociation of the single character (height) from 
the other characters which are present but not 
being measured. The only requirement of 
other characters is that they shall not change 
beyond the limits permitted by the logical de- 
scription of the phenomenon. 


This entire divorcement of a single property 
from all other characteristics means that it be- 
comes necessary almost immediately to supple- 
ment measurements of the single character 
with other information, for we are seldom, if 
ever, interested solely in such measurements 
by themselves. The more complex and vari- 
able the objects which are being studied, the 
more information is required to supplement 
single measurements. This supplementary in- 
formation may be obtained either through 
measurements of other aspects of the phenom- 
enon, or through intimate acquaintance and 
detailed description of it. 


* Clifford Kirkpatrick, “Statistical Studies of Personality and 
Personality Maladjustment.”’ Chapter XII 197-216, in 
Statistics im Social Studies, edited by Stuart A. Rice for the 
Committee on Social Statistics of the American Statistical As- 
“amas Philadelphia: The University of Pennsylvania 

ress, 30. 


HOW SCIENCE MEASURES 297 


The second desirable characteristic of our 
working concept of an aspect to be measured 
is that it shall be satisfactorily constant or 
unitary. This condition grows immediately 
out of the first one, namely, its relatively 
complete abstractness. For example, length 
is the same wherever found. However obvi- 
ous this may seem, the point is not universally 
conceded. For instance, Bridgman holds that 
the working concept of the character to be 
measured grows out of the procedure by which 
the measurement is accomplished. He points 
out that measurement procedures necessarily 
vary for different ranges of magnitude (e.g., 
inches as contrasted with light years), and 
that the concept of the measured character 
accordingly varies widely. A quotation will 
serve to make his position explicit: 

“To find the length of an object, we have 
to perform certain physical operations. The 
concept of length is therefore fixed when the 
operations by which length is measured are 
fixed: that is, the concept of length involves 
as much as and nothing more than the set of 
operations by which length is determined. 

. The concept is synonymous with the 
corresponding set of operations.” (p. 5) 


“Tf we deal with phenomena outside the do- 
main in which we originally defined our con- 
cepts, we may find physical hindrances to per- 
forming the operations of the original defini- 
tion, so that the original operations have to 
be replaced by others. These new operations 
are, of course, to be so chosen that they give, 
within experimental error, the same numerical 
results in the domain in which the two sets of 
operations may be both applied; but we must 
recognize in principle that in changing the 
operations we have really changed the concept, 
and that to use the same name for these dif- 
ferent concepts over the entire range is dic- 
tated only by considerations of convenience, 
which may sometimes prove to have been pur- 
chased at too high a price in terms of 
unambiguity.” (p. 23)7 

The present writer cannot concede that our 
concept of a character is determined wholly 
and solely by the method used to index its 
magnitude. It is conceivable that, under cer- 
tain circumstances, the method of measure- 
ment used might be the only approach to the 
character measured, and the character could 
not be apprehended in any other way. If 


+ P. W. Bridgman, The Logic of Modern Physics, p. 5 and 
23. New York: The Macmillan Co., 1927. 228p. By 


permission of The Macmillan Company, publishers. 








; 
t 
4 
3 
' 
i 


298 JOURNAL OF EXPERIMENTAL EDUCATION 


such were the case, then we would be forced 
to the position taken by Bridgman; but we 
must recognize that such a situation arises 
from great poverty of knowledge, and is not 
an ideal, or even a likely situation. Nor- 
mally, there are a variety of ways of measur- 
ing any character, and the resulting general 
concept of this character would naturally be 
something more than the aggregation of dis- 
crete concepts growing out of the independent 
measurement procedures. Further, we usu- 
ally have had a variety of experiences with the 
concept — many of them being experiences 


_with other aspects of the phenomenon being 


measured. We thus build up a rich concept 
of the phenomenon and of the character or 
aspect being measured. Our notion of dis- 
tance, for example, carries at least traces of 
the unlimited number of experiences we have 
had with distance, and is not in any sense con- 
fined to the procedures by which we measure 
space. 

We may mention as a third desideratum for 
our concept of the character to be measured 
that it be well defined. This condition both 
grows out of, and reinforces, the preceeding 
two. Until a concept has become well estab- 
lished, we must expect that abstraction, mod- 
ification in the light of new experiences, and 
critical analysis will be taking place concur- 
rently. It is the notion of a properly defined 
concept that makes possible its consistency 
throughout all ranges, and which justifies its 
abstraction. For example, if we should find 
ultimately that space is to be regarded as 
curved throughout the great reaches of astro 
nomical observation, then we may define space 
within the atom in the same terms. Again, re- 
fined observation might ultimately lead to the 
conviction that space is parabolic, the curva- 
ture becoming less (or more) in smaller 
ranges; if such should be the case, our defini- 
tion of space could include the parabolic ele- 
ment without compromising its consistency. 

We may, if we choose, define the character 
to be measured somewhat arbitrarily in ac- 
cordance with what we wish our concept of it 
to be. Referring to space again, no observa- 
tions could force us to give up our linear con- 
cept if we did not wish to, for any form of 
space would have linear aspects. We might 
wish to abandon any particular concept for 
one that was more useful if the first one gave 
rise to too. great error or to too great complex- 
ity in mathematical relationships, for certain 
magnitudes. No one can say that, for certain 


[Vol. 5, No. 3 


purposes, any concept which is possible is not 
permissible. Reality has many dimensions, 
and the criterion for the selection of any par- 
ticular aspect is one’s purpose. Thus whereas 
the scientist may be interested in experimental 
consistency and calculational convenience, 
other persons may have other purposes that 
demand different concepts which, for their 
use, may be more appropriate than the scien- 
tist’s concepts. 

Definition is a developing process; it may 
be very crude to start with, and become re- 
fined as one explores further his purposes and 
his observations. For instance we all have a 
concept of weight that is satisfactory for 
everyday purposes; yet, if we examine the 
character, “‘weight’’, under various conditions, 
we encounter certain problems which force us 
to re-examine our concept and make certain 
decisions regarding it. For example, if we 
find that an object weighs less on a spring 
scale at the top of a mountain than at sea 
level, we have to decide whether we wish 
weight to be a character which varies with 
altitude. If we decide that weight will be 
evidenced by a balance, and find that some 
things do not change weight (on the balance) 
when put in a vacuum, whereas others do, we 
have a further decision to make regarding the 
character which we wish the concept “weight” 
to embrace. Again, when we wish to deter- 
mine the weight of the earth itself—which is 
dense, and yet floats in space—a re-definition 
of our concept in somewhat broader terms is 
called for. Similar problems arise in connec- 
tion with any concept, set up to satisfy cer- 
tain purposes, and used to guide in the selec- 
tion of aspects of reality. 

Up to this point, we have said nothing 
about one of the most difficult problems of 
measurement in the social sciences, namely, 
the adequate definition of a highly complex 
character. By complex we mean concep- 
tually divisible into a number of constituent 
variables. The basic concepts of physics— 
mass, time, and distance—are relatively sim- 
ple, even granting that they may be expressed, 
and to some extent explained, as functions of 
other variables. The physical sciences rap- 
idly built up combinations of these elemental 
concepts in order to derive complex charac- 
ters that are more immediately useful in vari- 
ous situations. But these normally yield to 
explicit analysis, some being, at least in part, 
built-up concepts. Their composition is 
therefore known. Take, for example, the 








Vo. 


iS not 
sions, 
’ par- 
ereas 
ental 
ence, 

that 
their 
cien- 


may 
€ re- 
; and 
ive a 
for 
the 
ions, 
e us 
rtain 
’ we 
ring 
sea 
wish 
with 
| be 
ome 
1ce ) 
, we 
the 
tht” 
‘ter- 
h is 
tion 
s is 
1eC- 
cer- 
lec- 


ing 
of 
ely, 
lex 
ep- 
ent 
_— 
m- 
ed, 
of 
ip- 
tal 
ic- 





3 





ee 


March, 1937] 


complex character, “horsepower.” It repre- 
sents a somewhat arbitrary standard of 33,000 
foot-pounds per minute, being a combination 
of the three elemental concepts specified, plus 
a constant. Having expressed horsepower as 
a specific function of these simpler characters, 
it becomes possible to measure it accordingly. 

In the social sciences, however, we have not 
yet defined many of our general concepts so 
explicitly, and the measurement of them 
must, therefore, usually represent something 
of a sampling of the concept. As illustrations, 
we may refer to ability—in general, or any 
particular kind; to general intelligence, to 
“goodness,” quality, or value; or to such a 
concept as “‘size of a school system.” At first 
thought, one may be inclined to say that he 
knows just what each of these concepts stands 
for; and, if he fails at any point, he has re- 
course to the dictionary. Such a belief is, of 
course, highly superficial; it may be possible 
to identify certain elements embraced within 
each of these categories, and it may be readily 
possible to distinguish clearly between two 
specimens (cases) that represent widely differ- 
ent degrees of magnitude with respect to any 
of these complex characters; but such condi- 
tions are not sufficient for accurate and re- 
fined measurement. What we need to know, 
for example, is how any two specimens stand 
on the general character when one of them 
stands slightly higher than the other with 
respect to a certain element, but slightly lower 
than the other in a second element. Or, in 
more general terms, how much variation in 
any one elemental variable is equivalent to a 
given amount of variation in any other elemen- 
tal variable or group of variables embraced by 
the general character? 

This problem of expressing a complex con- 
cept in terms of its constituent variables is 
not to be fully solved by determining fixed 
scales of equivalence between the different 
elements, nor by applying constant coefficients 
to them, for the equivalence of different ele- 
ments or groups of elements must be regarded 
as something that may vary from one part of 
the scale to another, and with each change in 
any one of the variables. We are wont to use 
such devices as linear (multiple) correlation 
and index numbers for approximating these 
complexes; we may continue to rely upon 
them in practice, but we must recognize them 
as makeshifts for a more faithful representa- 
tion which could be developed if we could ade- 
quately define the general character so that 


HOW SCIENCE MEASURES 299 


its accurate analysis in terms of compensating 
degrees of different elements could be effected. 

Thus far, we have been considering what 
was posited as the first requirement of meas- 
urement, namely, a concept of some character 
to be measured. ‘Three desirable characteris- 
tics of this concept have been referred to—it 
should be a pure abstraction, uninfluenced by 
elements which are not specifically included in 
it; it should be constant for all ranges of 
magnitude, though this constancy does not 
necessarily exclude the condition of several 
sub-classes of the concept if such should prove 
desirable (any more than sub-classes in any 
other concept violate the unity of the general 
category); and, the concept should be well 
defined, so that it will afford a satisfactory 
working guide for application. In leaving this 
first point, it may be noted that the discussion 
has centered around a working concept more 
than around the character itself, because our 
concept of the character is all that we can be 
certain of; it is as close as we can approach 
to reality. 

As a second requisite for measurement, we 
should specify a@ satisfactory representation of 
the character which we desire to measure, in 
the degree to which it exists in the phenomena 
under observation. That is, both the charac- 
ter and the objects which exhibit a certain de- 
gree of this character must be either immedi- 
ately or ultimately sensible (perceptible), so 
that they may be apprehended, if they are to 
be measured. This requirement means that, 
for those characters or objects which are not 
immediately sensible, we must be able to find 
objectifying functions if we are to measure 
them. 

Concerning the question as to whether we 
should seek a representation of the true char- 
acter as it exists in nature, or seek a repre- 
sentation of our concept of this character, it 
does not seem profitable to enter into extended 
discussion. We grant the possibility that our 
concept, the nearest corresponding actual 
character, and the aspects of phenomena 
which we actually measure, may all be differ- 
ent. Further, it may reasonably be possible 
for the measurements to be closer to the ac- 
tual character than to our concept, or vice 
versa. These discrepancies may arise through 
lack of complete fidelity in our understanding, 
or in our procedures. Our concepts are fal- 
lible; and they may be in part arbitrary, as 
previously pointed out. Our procedures are 
also subject to considerable error. If we are 


SS a 8 


ee 











f 


300 


concerned with securing measurements that 
are useful for general purposes, such facts 
need not disturb us. If, on the other hand, 
we are seeking to describe nature as accu- 
rately as possible, we are forced into the field 
of metaphysics, and we must admit that we 
cannot be certain. Granting the subjective 
element in all of our knowledge, we concede 
the point and proceed. 

While many characteristics we desire to 
measure are satisfactorily represented in the 
natural state of objects which we observe, 
many others are not. For example, we can 
readily measure the length of an ordinary ta- 
‘ble, but the diameter of the earth presents spe- 
cial problems. We cannot apply a linear scale 
directly to it. The diameter of an atom, or 
of the nucleus in the atom, presents still other 
problems; our senses would never reveal to us 
that there were atoms which might have diam- 
eters. Again, we cannot “put a thermometer 
in the mouth of the sun,” nor can we place a 
star on the platform of a scale to weigh it. 
It is among characters and objects that are not 
susceptible of direct perception or ready 
manipulation that the great triumphs of mod- 
ern science are to be found. To index magni- 
tudes under these difficult conditions has called 
for the greatest ingenuity, and the problems in 
these fields constitute one of the most fasci- 
nating areas of research. 

Difficulties of observation may be classed 
as of three kinds: (1) difficulties growing out 
of the limitations of human perception; (2) 
difficulties growing out of the location of the 
observer (with reference to the phenomenon) ; 
and (3) difficulties growing out of the use or 
application of ordinary measuring instruments 
and scales. To overcome these difficulties of 
observation, which occur both singly and in 
combination, it is necessary to resort to in- 
direct measurement. The most common 
method of indirect measurement has been to 
utilize what may be called functional repre- 
sentation. We shall mention five types or 
media of functional representation by which 
we may extend our immediate senses or make 
possible the use of convenient scales. 

Perhaps first of all we should list the mathe- 
matical functions, because they enter into or 
underlie all of the others. They may be il- 
lustrated by the manipulations of direct or in- 
direct optical perception in order to relate the 
observations properly to ordinary scales, as 
when one calculates the height of a tree, or the 
length of a pond, or the diameter of the earth, 


JOURNAL OF EXPERIMENTAL EDUCATION 





[Vol. 5, No. 3 


by utilizing certain mathematical functions o{ 
the observations which he takes. Mathe- 
matical relationships will be found to be a part 
of all indirect measurements, and we may note 
that they underlie the construction of all in- 
struments and equal-unit scales which are pre- 
pared. They are given separate mention here 
because of the fact that they may occur apart 
from the use of any special instrument. 

As the second and third types of functional 
representation, we shall mention physical and 
chemical functions. The two are referred to 
together because the boundary line between 
them is so indefinite, and because they so fre- 
quently occur in combination. As simple ex- 
amples of the physical functions, we may note 
the use of lenses and prisms to bend light 
rays, the use of machines to change the direc- 
tion of force, and to change the ratio of the 
force-distance components of work. It is also 
worth observing that physical transformations 
are normally an integral element of our meas- 
uring instruments. When these processes are 
employed as steps in measurement, it is obvi- 
ous that the results are based on these physical 
functions of the original characters. In the 
field of chemistry, we may refer to the various 
steps incident to quantitative analysis by 
which various substances are precipitated or 
otherwise put in such form that they are ap- 
propriate for measurement. Or we may con- 
sider all of those refined and highly technical 
processes by which valence is measured, 
atomic weight and atomic numbers derived, 
etc. 

Except in the simplest instances, in the phys- 
ical sciences as well as in others, measurement 
of a character as it is exhibited by some object 
or set of objects is not performed directly up- 
on the objects as they normally exist, but is 
dependent upon transformations that are be- 
lieved to be reasonably accurate. As a prac- 
tical illustration of the sequences of physical 
and chemical transformations which may oc- 
cur in measuring, we may refer to the method 
used during the World War to detect and 
locate the position of enemy guns. A hot- 
wire Helmholtz resonator was used, tuned so 
as to respond only to the appropriate sound 
waves. The oscillating air currents set up by 
these particular sound waves, passing in and 
out of the neck of the resonator, cooled a wire 
grid located there and heated by electricity. 
The cooling changed the electrical resistance 
of the grid, which was measured by a Wheat- 
stone bridge, and the result was photographed. 














S of 
the- 
part 
note 


pre- 
here 
Dart 


onal 
and 
1 to 
een 
fre- 


10te 
ight 
rec- 

the 
also 
ions 
Pas- 
are 
ovVi- 
ical 
the 
ous 


| or 
ap- 
‘on- 
ical 


ed, 


yS- 
ent 
lect 
up- 
t is 


‘ac- 
ical 
Oc- 
10d 
ind 
ot- 

so 


ind 
ind 
ire 
ce 


at- 
ed. 








March, 1937] 


Readings of the varying electrical resistance 
as shown on the photograph indicated the in- 
tensity of the sound waves, and from similar 
readings in a number of places, the source of 
the sound could be located by mathematical 
calculations with considerable accuracy. This 
same device is regularly used in laboratories 
for many purposes. It is interesting to note 
the steps by which the desired characters (dis- 
tance and direction) are successively ap- 
proached by converting the variables which 
could be secured under the restricted condi- 
tions of observation first into one form and 
then another until they finally yield measure- 
ments of the kind which are sought. 

It quickly becomes clear from a study of 
actual measurement practices that the variable 
which is ultimately compared directly with a 
scale is frequently far removed in nature from 
the original character one desires to study. 
Thus, to measure the temperature of stars so 
distant that no heat from them can be felt, 
one studies the variation in the intensity of 
different wave lengths revealed by the spectro- 
scope. To ascertain the composition of the 
incandescent solid or liquid interior of stars, as 
well as the gases in the atmosphere surround- 
ing them, one notes the distribution of colors 
and lines in a spectrum. Or to determine the 
speed with which a star is moving toward or 
away from the earth, no timepiece or method 
of measuring distance need be employed,—one 
studies the slight displacement of certain spec- 
tral lines. These spectra are often photo- 
graphed, so that one finds himself measuring 
the velocity with which some distant star is 
speeding off into space, by looking at an inert 
picture lying in his hands. 

The fields of physics and chemistry, with 
their many ramifications and specialized ap- 
plications, are replete with examples of meas- 
urement through the functional transformation 
of one kind of variation into another, and fur- 
ther illustrations in these areas need not be 
offered. It may be noted that the changes 
which are effected by these transformations 
are of three kinds: they may be modifications 
of stimuli which will directly augment ordi- 
nary perception (as bending light rays); they 
may be changes in the form of the object 
which is to be observed (as chemical precipita- 
tion); or they may be changes in the form of 
the character itself (as when light or mechani- 
cal energy is converted into electricity, or vice 
versa). The relation of these changes to the 
general measurement process must be kept in 


HOW SCIENCE MEASURES 301 


mind: the direct measurement or scaling of 
the final form of the variable is taken as an in- 
dex of the degree of the original character, 
through the agency of various functional 
media or reagents, and in this indirect way the 
original character is itself measured. 

The physical sciences are, of course, not the 
only ones in which the influence of one vari- 
able upon another is utilized as a means of 
measurement. We may mention biological 
reactions as constituting our fourth class of 
functional representations of characters to be 
measured. In the field of biology we find sub- 
stances or characters which can be satisfac- 
torily studied only through noting their influ- 
ence upon biological organisms. For example, 
such substances as vitamins, enzymes, and 
hormones—of fundamental importance to life 
and health in animals and human beings— 
have so far generally defied chemical analysis, 
and their potency is indicated only by physio- 
logical assay. That is, for example, the vita- 
min strength of many substances which are 
commercially produced is regularly tested 
through the effect of these substances on the 
health of rats when introduced into their diet. 
The strength of insulin is standardized by ad- 
ministration to rabbits, one unit of potency 
being defined as one third of the amount which 
will lower the blood sugar of a rabbit of av- 
erage size to an average per cent of .o45 for 
a period of five hours after its injection. One 
“dog unit” of extract of parathyroid gland is 
the amount which will produce an increase of 
5 milligrams in the blood serum calcium of 
dogs weighing about 20 kilograms, in about 
sixteen hours. Similarly, medicines, such as 
digitalis, serums, and antitoxins are, in com- 
mercial practice, tested for strength by admin- 
istration to cats, guinea pigs, and other ani- 
mals. If we should trace through the com- 
plex physiological reactions that take place in 
these reagent organisms we should find a chain 
of complicated processes of various kinds, just 
as we find in the case of the more artificially 
arranged and mechanically controlled series of 
reactions in the physical fields. 

Such biological tests should not be regarded 
as makeshifts or substitutes. The balancing 
of the amount of one of these complex sub- 
stances against a given physiological reaction 
is as direct as the process of titration in chem- 
istry. Physiological reaction will in fact rank 
with the very best of chemical indicators; a 
few milligrams of a vitamin will make the dif- 
ference between health and death; the human 


302 JOURNAL OF EXPERIMENTAL EDUCATION 


body will react to pituitrin when in a dilution 
of one part in one-hundred-million, and less 
than one one-hundred-millionth of an ounce of 
epinephrin will affect the heart and blood ves- 
sels. If some allowance has to be made for 
variations in the reagent organism, the same 
thing must be done in the physical fields, and 
any difference in practice is one of degree, 
and not of kind. If the complex substances 
in question are ever specified adequately in 
physical and chemical terms, we should re- 
gard those measurements as the indirect ones. 
When the stimulus strength of a substance or 
- force is sought, a biological reaction in the 
normal, organic, vital situs and role estab- 
lished by nature is the appropriate functional 
representation of the original character, just 
as a physical transformation is appropriate 
for a character which is primarily physical in 
its properties. 


It is appropriate that we also recognize the 
type of reagents which are called for in the 
measurement of characters that are essentially 
psychological. These will constitute our fifth 
and final class of functional representations. 
While some characters that are primarily psy- 
chological in their significance can be de- 
scribed (measured) in large part in physical 
terms, this becomes possible either because a 
psychological reagent has established a sub- 
jective scale which furnishes standards for the 
mechanical scale, or because some psychologi- 
cal reagent serves directly as the indexing 
medium. As an example of preliminary sub- 
jective analysis and scaling, we may mention 
the field of color. Hue, brilliance, and satura- 
tion, as the determinants of color, are expres- 
sible uniquely in physical terms (wave length 
and energy); but they were originally isolated 
and described by subjective processes. That 
is, red, as a hue, is identified psychologically, 
and what shall be regarded as a variety of red 
on the one hand, or as a variety of red-orange 
or some other color on the other hand, is a 
subjective matter. When once these cate- 
gories are established, they can be physically 
expressed; the determinations which give sys- 
tem and significance to the physical measure- 
ments were, however, made by the human be- 
ing as a competent and efficient psychological 
reagent. 


In the field of sound, we find a more pro- 
nounced disparity between the physical and 
psychological correlates. In physical terms, 
the three aspects of single sounds are intensity, 
frequency, and wave form; the correspond- 


[Vol. 5, No. 3 


ing psychological characters are loudness, 
pitch, and quality. But loudness does not 
vary as intensity; and while a functional re- 
lation between them has been established in 
terms of Weber’s law (Psychology) it must 
be borne in mind that there is no direct meas- 
urement of loudness except through the hu- 
man being. Pitch can be well described in 
physical terms, but harmony is a character of 
peculiarly human significance, varying with 
the individual and with the setting, and it is 
difficult, if not impossible of independent 
prognostication in terms of physical 
properties. 

Quality of tone can be expressed in physi- 
cal terms, and made visual, but the human be- 
ing is the reagent that decides finally what is 
good quality and what is not. For example, 
a mechanical test was recently devised for 
measuring the efficiency and tone quality of 
violins. The test was found to agree with the 
judgments of expert players, and it could 
therefore be substituted as a more stable, more 
carefully graduated, and more readily avail- 
able measure. If however the test had not 
agreed with the experience of the musicians, 
the test would have had to be altered or aban- 
doned as invalid. 


In many instances, a psychological reagent 
is regularly employed to make a character 
sensible because there is no other means pos- 
sible. As an example we may consider one of 
the classical methods in psychology for study- 
ing degrees and kinds of difficulty, namely, 
maze running. There are no scales which 
can be applied directly to a maze to deter- 
mine its degree of difficulty; so recourse must 
be taken to some form of functional repre- 
sentation. The character to be indexed is es- 
sentially psychological, and hence a psycho- 
logical reagent must be employed. The real 
measure of the difficulty of a maze lies in the 
difficulty of certain psychological reactions, 
notably, the resolving of the totality of likely 
responses into a set of efficient responses. 
But the psychological reactions cannot them- 
selves be perceived directly, so that length of 
time, or some other perceptible aspect of be- 
havior, must be taken as an index of them, 
and observations are normally recorded in or- 
dinary units of time, errors, number of trials, 
etc. The indirectness of such measurement 
lies in the use of overt behavior to index cor- 
tical and neural activity, but this substitution 
is less indirect than is common in physical 
fields. The employment of a psychological 














1€SS, 

not 
| re- 
d in 
nust 
€as- 


1 in 
r of 
with 
it is 
lent 
sical 


ysi- 
be- 
it is 
ple, 

for 
r of 
the 
yuld 
lore 
ail- 
not 
ins, 


ent 
‘ter 
)OS- 
of 
dy- 
‘ly, 
ich 
er- 
ust 
re- 
es- 
ho- 
eal 
the 








March, 1937) 


reagent for the measurement of difficulty is 
just as appropriate as is the use of a scale for 
measuring distance. 

In certain other areas we must also depend 
upon psychological reagents—perhaps with a 
certain amount of introspection—to represent 
the character we desire to measure. This is 
generally true in the field of human values, 
which include such concepts as quality, ap- 
propriateness, preference, etc., and spread out 
over the whole of life to embrace physical as 
well as moral, esthetic, and other philosophi- 
cal elements. In some cases — not in all — 
there are correlated indices that can be ex- 
pressed in physical terms. For example, cer- 
tain elements of quality in physical objects 
can be indexed physically, when these ele- 
ments have been subjectively established, and 
are amenable to such treatment. In the mat- 
ter of total satisfaction with a complex ob- 
ject, however, such as with an automobile, it 
is difficult, if not impossible, to represent the 
general character by detailed measurements 
of all of the factors that enter into it. 

The difficulty of such physical indexing 
lies partly in the fact that we do not know all 
of the individual physical elements of quality, 
but it lies even more largely in the fact that 
value does not attach to such elements in 
isolation, but in pattern, and is a very com- 
plex function so that its ultimate expression 
in physical terms is difficult. In the field of 
art, for example, only some of the grosser as- 
pects can be represented by physical meas- 
urement with any success. One can of course 
measure any particular picture in as great 
mechanical detail as he pleases, but such 
measurements afford little suggestion as to 
the esthetic value of such a picture. In the 
main the value of esthetic phenomena must 
be determined by human beings reacting di- 
rectly to the objects.* 

A number of judgments are usually sought 
in order to eliminate erratic individual bias, 
so that we usually have several judges to 
award the prize at an art exhibit, to determine 
the winning debate team, etc. Boards of di- 
rectors as a group decide upon the relative 
desirability of various policies, and legisla- 
tures estimate the voting strength of various 
groups that favor or oppose certain actions. 


* There are, of course, certain procedures in judging that 
are designed to help one arrive at a true expression his val- 
ues. These procedures will be found reviewed, and the litera- 
ture relating to them cited, in The Methodology of Educa- 
tional Research, by Carter V. Good, A. S. Barr, and Douglas 
E. Scates, p. 409-439. New York: D. 


¥ Appleton—Century 
Co., 1936. 882p. 


HOW SCIENCE MEASURES 303 


Juries weigh the evidence concerning the guilt 
of a certain defendant, and judges pronounce 
sentences representing (perhaps crudely, but 
nevertheless formally) the degree of wrong- 
ness which society attaches to various of- 
fenses. While these examples do not nor- 
mally represent scientific work, they illustrate 
types of situations in which dependence must 
be placed upon a human reagent to represent 
the character to be indexed, and in many of 
these instances the values involved are large. 

This discussion of psychological reagents 
should make it clear that, while physical de- 
terminants of certain psychological reactions 
may be ascertained so that they can be physi- 
cally identified when they recur, such physical 
measurements are made possible only through 
the establishing of a relationship between the 
physical units and the psychological values, 
which can be accomplished initially only 
through the identification of values by a psy- 
chological reagent. There are, on the other 
hand, many psychological reactions for which 
the physical stimuli are so complex, and so 
dependent upon subtle elements of pattern, 
that any attempt to account for the reaction 
by objective measurements of conditions is of 
little avail. In still other areas of human 
value, such as abstract moral, ethical, and re- 
ligious standards, there may be no physical 
counterpart which enters immediately into the 
consideration, and one must depend wholly 
upon human reactions for indexing the 
strength of the variable. In any case where 
psychological reactions are concerned, it must 
be borne in mind that the physica! measure- 
ments are the indirect ones for the purpose of 
indexing strength or quality of reaction, and 
they necessarily lose their precision and their 
significance whenever the psychological values 
shift, as they do from time to time, from in- 
dividual to individual, or from one setting to 
another. Where a psychological reagent is 
required to represent the variable to be in- 
dexed, any attempt to substitute mechanical 
(objective) measurements which have not 
been psychologically calibrated may, in spite 
of their apparent precision, result in measure- 
ments which are unbelievably crude. 

This discussion has dealt at some length 
with measurements made through the instru- 
mentality of various media of functional rep- 
resentation. It has been noted that in each 


instance the variable finally depended upon 
(the one directly observed) was different from 
the original character to be measured, and 


Se ne 9 





ig 
# 


304 JOURNAL OF EXPERIMENTAL EDUCATION 


that in some cases there were a number of 
steps, or transformations, involved. We have 
not assumed that all of these functional rep- 
resentations are equally precise; modern 
scientific thinking, however, recognize many 
degrees of functionality, from exact mathe- 
matical functions down to a complete absence 
of correlation. It is possible that the five 
media discussed—mathematical, physical, 
chemical, biological, and psychological—rep- 
resent something of a descending order of ex- 
actness. An awareness of this tendency does 
not, however, lead to the conclusion that the 


_less exact representations are unsuited for 


scientific purposes; it rather lays emphasis 
upon the necessity for utilizing proper pro- 
cedures of work and of interpretation so that 
false conclusions are not drawn. With the 
rapid rise of statistical methods in recent 
years, the understanding of loose functions 
has itself become something of a science, and 
much may be done in areas today that would 
have been impossible a generation or two ago. 
It may be said that, in general, there is much 
more to be hoped for from employing appro- 
priate reagents, even with a recognized lack 
of precision and fixity, than there is from em- 
ploying inappropriate reagents where the pre- 
cision is only a superficial aspect and the 
gross lack of validity lies hidden and over- 
looked. 

It appears that there are still other ways 
of securing representation of characters that 
are difficult of direct measurement. We have 
been giving attention to functional representa- 
tions, which index a variable by utilizing its 
effects upon other variables; we may now 
give attention to methods which seek to in- 
dex a variable by studying the behavior of 
elements which compose it. It may be appro- 
priate to refer to these methods as analytical 
representation, in contrast to the functional 
representation which has just been discussed, 
since they are concerned with the more de- 
tailed elements of the general character. The 
attacks about to be examined serve two some- 
what different purposes: they may be di- 
rected toward the description of the general 
character through representing its compo- 
nents, or they may be directed toward the de- 
scription of the components themselves 
through breaking down the general character. 
Each of these two possibilities must be kept 
in mind. 

In the early part of this paper, attention 
was called to the importance of exact defini- 


[Vol. 5, No. 3 


tion of complex characters. It was pointed 
out that such definition would make possible 
the measurement of these characters through 
measuring their constituents, as is done quite 
commonly in the physical sciences. Unfortu- 
nately, such definitions are not normally 
available in the social sciences, and so when 
we seek to index variations in the general 
concept by measuring its components, we are 
usually forced to resort to some sort of 
sampling procedure. This is done in vari- 
ous ways. One may for example content 
himself with the measurement of what he re- 
gards as an outstanding component, and not 
go further. Examples of this are common in 
our research literature, as when one uses 
length of time as the index of difficulty in 
learning nonsense syllables or in maze run- 
ning, whereas in reality difficulty has many 
aspects and many manifestations. One may 
use temperature as an index of weather, value 
of real estate as an index of ability to pay 
taxes, years of training as an index of merit 
for determining teachers’ salaries, etc. The 
value of this method of representation de- 
pends upon the extent to which the compo- 
nent which is used fluctuates in accordance 
with the general complex variable; it may, or 
may not, be satisfactory. 

A second type of analytical representation 
consists in measuring various aspects of a gen- 
eral character and using these measures indi- 
vidually, with no attempt to consolidate them 
into a single index. For certain purposes, 
this is the most serviceable thing to do; it 
yields analytical results which are explicitly 
descriptive. In the case of personality meas- 
urements, for example, a profile is ordinarily 
more significant than a composite general rat- 
ing. Or, referring to a fairly new field, we 
may cite the measurement of a person’s abil- 
ity to drive an automobile, which is now meas- 
urable in many of its components, with known 
critical areas in each.* Defects or deficien- 
cies discovered in certain areas of driving abil- 
ity might possibly be remedied. When the 
elements of a composite are individually of 
definite significance, measures of them may 
be more valuable than a single figure for the 
more general concept which embraces them. 

As a third procedure in analytical represen- 
tation, we may refer to those statistical meth- 
ods which are directed toward throwing light 


“Since this is a new field, and one in which interest is 
a gone, ¥, some reference is cited:—“Instruments 
Measuring Driving Skill,” by Harry R. 

DeSilva, , ~ a4, IX (April, 1936), 101-08. Bibliography 





ee ae se ae lu le | 


ee ee ae 








nted 
sible 
ugh 
uite 
rtu- 
ally 
Then 
eral 
are 
of 
ari- 
tent 
 re- 
not 
n in 
uses 
y in 


any 
nay 
alue 
pay 
erit 
The 
de- 
|po- 
nce 
, or 


tion 
en- 
idi- 
lem 
Ses, 


itly 
Das- 
rily 
rat- 


bil- 
2aS- 
wn 
ien- 
bil- 
the 
of 
nay 
the 
em. 
en- 
th- 
ght 
st is 
ents 


phy. 








7 Pee 


March, 1937)\ 


on the constituents of a complex variable, 
and measuring their relative importance. 
These methods are to be regarded as supple- 
mentary to experimental methods in which 
direct measurement of elements is possible. 
They generally assume that a satisfactory 
measure of the general character is available, 
and that all that is wanted is a proper anal- 
ysis of this. For such a purpose multiple re- 
gression is a common method. Assuming 
that the inter-relationships are all linear, that 
constant weights will suffice, and that the ap- 
propriate combining function is an additive 
one, this method will give optimum weights 
for the components. Closely related to mul- 
tiple regression is the analysis of variance, 
which also furnishes weights (measures of 
relative influence) for factors which contrib- 
ute to the variation of a composite. This 
technique, under certain assumptions as to 
the distribution of elements contributing to 
correlation, makes possible an estimate of the 
relative influence of component factors with- 
out requiring a direct measurement of each 
of them. The technique has been much used 
in agricultural studies. Various forms of fac- 
tor analysist represent another development 
of indirect or statistical measurement through 
correlation; these again yield weights for fac- 
tors which can not be measured directly. 
While all of these methods are likely to result 
in certain mathematical artificialities, because 
of assumptions which are necessary, they give 
results which are valuable, and they yield 
measures which would in many cases be en- 
tirely impossible to secure in any other way. 

The attempts to combine various compo- 
nent measures so as to provide a measure of 
a general character may be regarded as com- 
prising the fourth type of procedure in 
analytical indexing. In the physical sciences, 
such measures are commonplace, as illustrated 
by such concepts as horsepower, feet per sec- 
ond, per second, watt hours, etc. The two 
principal difficulties in accomplishing similar 
measurements in the biological and social 
sciences lie in securing a clear and compre- 
hensive statement as to what variables are em- 
braced by the general concept and in securing 
a combining function of these variables which 
will parallel the intricate pattern of interrela- 
tionships which frequently exists in natural 
phenomena, or in our concepts of phenomena. 
One course is to add the values of components 


_? For examples, consult the works of Spearman, Holzinger. 
Keliey, Hotelling, and Thurstone. 


HOW SCIENCE MEASURES 305 


directly, either without special weighting or 
with arbitrarily determined weights, such as 
those afforded by making the variabilities 
equal, or by applying any other purely artifi- 
cial set. Multiple regression and index num- 
ber techniques afford a somewhat more elab- 
orate form of addition, and have received 
wide use.t Horst has developed a combining 
function to be used when nothing is known 
directly about the weights.** 

Other statistical functions of the observa- 
tions may, of course, be set up so as to satisfy 
various criteria that may appear to be appro- 
priate to the circumstances and the purpose. 
Thus, we now have measures of the financial 
burden of supporting a proper level of educa- 
tion in various localities, measures of the ef- 
fort being put forth to meet these educational 
needs, and measures of the resulting need for 
aid from a larger base of support.* Many 
other practical measures of this sort have 
been devised, and are in use. 


These various types of analytical represen- 
tation are capable of yielding fair indices of 
what we wish them to—the more so when 
we are certain of what we want. It appears, 
however, that we often depend upon statisti- 
cal results in ways that we should not. That 
is, we use them as though they were adequate 
descriptions of the general concepts we are 
seeking to measure when we might experience 
great difficulty in attempting to make explicit 
either to ourselves or to others just what this 
complex concept is that we think we are meas- 
uring by these methods. It is this situation 
that has given rise to the somewhat facetious, 
but scientifically conservative, statement that, 
“Intelligence is what is measured by intelli- 
gence tests.”” The error is not so bad (scien- 
tifically) when we at least recognize that our 
measures may not coincide with any concept 
that we originally held, or have subsequently 
developed. 


The two general classes of representation 
which have been named — functional and 
analytical — probably cover sufficiently well 
the various procedures which are being uti- 


t Multiple correlation methods are, of course, familiar to 
social scientists, but the index number technique has not 
received such w treatment. The writer has given a 

of its characteristics and applications in: 
General Nature and Applicability of Index Numbers for 
conte Ye of Experimental Education, 1V (March 

** Paul Horst, “Obtaining a Composite Measure From a 
Number of Different Measures of the Same Attribute.” 
Psychometrika, 1 (March, 1936), 53-60. 


* Federal Su lor Public Education, by Paul R. Mort 
ro otm. Y.: Teachers College, Columbia University, 
36. p. 


eats shan aod 


- 


306 JOURNAL OF EXPERIMENTAL EDUCATION 


lized for measuring characters that cannot be 
dealt with directly. They embrace the effects 
of a character upon other variables, and also 
the interrelationships between conceptually 
complex characters and their constituents. 
Some characters are of course already within 
the range of the human senses, ready for di- 
rect measurement. Most of them, however, 
have to be indexed through the associated 
variation of other characters which are or 
may be made sensible. The nature of these 
indices varies with the nature of the phenom- 
ena being investigated; representation may be 


- accomplished through the utilization of any 


relationship which is regarded as appropriate 
to the data and to the purpose. It is through 
the discovery of means of bringing obscure 
characters within the range of human percep- 
tion that science has been enabled to make 
its greatest advances. It is in this field that 
the ingenuity and skill of the research worker 
have met their greatest test, and it is also 
here that the mind of man has shown its 
greatest triumphs over apparently insuperable 
obstacles. 

We come now to the third and final require- 
ment of measurement. We have stated as 
the first two requirements, a working concept 
of something to be measured, and a satisfac- 
tory representation of this concept, so that 
its degree can be ascertained. The third re- 
quirement which we shall mention is a basis 
of quantitative comparison. In order to 
measure, we must have some quantity with 
which we may compare the degree of the (di- 
rectly or indirectly) observed character. We 
may now examine this concept of a quantita- 
tive standard to see what is necessarily 
involved. 

If we are used to thinking of measurement 
in terms of elementary physics, or in terms of 
ordinary, everyday purposes, we will probably 
think of this quantitative standard as being 
a scale, having various mathematical subdivi- 
sions which afford many reference points for 
comparison, and which permits of a high de- 
gree of refinement. This concept of the 
quantitative criterion is not however the only 
one which is possible. It will be necessary 
to recognize at the outset that measurement 
may have any degree of precision. The 
measurement of distance, for example, may 
vary in refinement from a billionths of an 
inch, through the range of ordinary require- 
ments, to results that are very crude. The 
procedure and the scales which are used will 


[ Vol. 5) No. 3 


vary according to the refinement which is de- 
sired, and may be very fine, or very coarse. 
The present discussion will emphasize the 
simpler forms of comparison because they 
seem so often to be overlooked when one 
thinks of measurement. 

Measurement is essentially a “more-than” 
or “less-than” type of comparison between a 
reference point (usually a mark on a scale) 
and the phenomenon. This comparison may 
in practice take a variety of forms, and is not 
always direct. Reduced to its simplest terms, 
measurement is the comparison of any two 
quantities or degrees, one of which is taken 
as the reference point or basis of comparison 
for the other. We may therefore have meas- 
urement when we have only a single reference 
point, the resulting measurement being 
dichotomous. That is, an observed degree of 
a character either falls short of the standard, 
or it exceeds the standard; it is quantitatively 
less than the amount represented by the 
standard, or it is quantitatively greater than 
the standard. (We may dismiss cases which 
appear to equal the standard exactly, either 
by assuming that more refined comparison 
would throw them to one side, or by consider- 
ing them as being half above and half below.) 

If it appears that dichotomies represent 
too crude a scaling to be regarded as measure- 
ment, we need only point out that they are in 
common use in science. Most familiar in the 
social sciences is probably the practice of scor- 
ing a response to an “objective” test item as 
right or wrong, recognizing no other degrees 
of knowledge or ability. We find interesting 
uses of dichotomous measurements however in 
many fields of science. Geologists, for ex- 
ample, may study the size and structure of 
fossil remains of leaves and other fauna to 
obtain evidence of varying amounts of water 
vapor in the air at different times in the past. 
These biological responses to changes in the 
amount of water vapor may then be construed 
in dichotomous fashion to indicate the occur- 
rence of successive eras which had increased 
or decreased amounts of dampness, and these 
alternations are correlated with other lines of 
evidence to piece together a story of the dis- 
tant past. 

Dichotomies may be used directly, as a 
final form of measurement, but they find their 
widest use as an intermediate form of index- 
ing, in combination with enumeration. Thus, 
in biology, if one is measuring the ability of 
successive generations of amoeba to adopt to 














Mar 


wall 
stan 
ous 
the 
num 
of a 
iten 
rect 
iten 
he | 
] 
east 
ber 
era 
of | 
enc 
psy 
ma 
ber 
cor 
gre 
vic 
be 


—t tmnt OD 


— 


na menw & se 4.7 








se. 


one 





—- sii 


ce 





March, 1937) 


warmer temperatures, the number out of a 
standard population which survive at vari- 
ous temperatures may be taken as an index of 
the adjustment. In psychological fields, the 
number of pupils who fail on various items 
of a test indicate the relative difficulty of the 
items. Somewhat similarly, but with the di- 
rection of the action reversed, the number of 
items “passed” by a pupil indicate the degree 
he possesses of the ability being measured. 

In vital statistics, the prevalence of dis- 
ease is not infrequently indicated by the num- 
ber of deaths from the disease, and the gen- 
eral healthiness of a city is measured in terms 
of death rates. “Equity often noticed differ- 
ences” is a Classical tribute to dichotomies in 
psychology. In politics, voting is often a 
matter of one choice out of two, and the num- 
ber of votes favoring a party or candidate is 
commonly taken as an indication of the de- 
gree of fervor which was felt by the indi- 
vidual voter. In economic statistics, it has 
been found that the trend of the stock market 
can be represented with great fidelity—and a 
certain added significance—by counting each 
day the excess of the number of stocks which 
changed in price one way over the number 
which changed the other way. 

This second use of dichotomies in measure- 
ment is particularly noteworthy because of 
its extensive employment. It combines enu- 
meration with a dichotomous expression of 
condition, the condition in each individual 
case representing the relative strength of the 
character being indexed in comparison with 
the set of opposing forces. Being interested 
in the general or average strength of the char- 
acter under observation, we count the number 
of cases exhibiting one or the other outcome, 
and draw conclusions. The premise under- 
lying this form of measurement is that, the 
larger the proportion of cases exhibiting a 
given condition, the stronger the forces pro- 
ducing that condition must be, for they have 
been able to predominate in a greater number 
of instances over opposing forces which are 
present in the different cases in varying de- 
grees, and which produce the opposite or con- 
trary condition. We assume that these 
“cases” are complex, variable objects. Even 
if they should represent the contrary forces 
in equal instead of varying amounts, however, 
they still can be used as a dichotomous index 
to distinguish between forces which are able 
to affect all of them and those which affect 
none of them: but for all practical purposes 


HOW SCIENCE MEASURES 


397 


cases can be found which will vary satisfac- 
torily. If different cases (populations) are 
used in a comparison, assumptions of samp- 
ling enter; the groups must represent the 
same characteristics in the same degree. 

We may note specifically the case of index- 
ing knowledge or ability with “objective” test 
items. Assuming that the construction of the 
test has been completed, the scalar variable is 
not the proportion of organisms (cases) which 
respond, but rather the proportion of the 0d- 
jects to which an individual organism re- 
sponds with a certain type of response. The 
individual responding satisfactorily to the 
larger proportion of test items is deemed to 
exhibit the stronger force or ability. The 
reasoning here is the same as in the preced- 
ing cases; but a slight shift in the conditions 
has been made. We must assume that the 
test items are complex and demand varying 
degrees of ability—at least for any individual. 
If the test items are equally saturated with 
the ability demanded—i.e., if they are equally 
difficult, on the average, for a group of indi- 
viduals—then the type of ability which they 
will index satisfactorily is purely additive in 
its variation. Ordinarily, little can be said 
about the magnitude of differences in strength 
which different proportions indicate, either in 
testing or in the more general case; yet if the 
test items (or the responding organisms, in 
the more general case) are known to be pres- 
ent according to an evenly spaced distribution 
of strength, more detailed conclusions can be 
drawn from the differences in proportion. 

A third application of dichotomous meas- 
urement occurs in the case of ranks. Ranks 
represent a series of dichotomous compari- 
sons, in place of a single one. That is, in 
ranking a group of children on the basis of 
height, each child momentarily becomes the 
standard of quantitative comparison for the 
one being placed. If the character is di- 
rectly perceptible, as height, this comparison 
can be effected directly; if the character is not 
directly perceptible, as knowledge or ability, 
some representation of the character will first 
have to be secured, as by means of a test. 
The same sort of comparison is involved even 
if the ranking is based on a series of refined 
measurements. A series of ranks is in any 
case essentially a systematic succession of 
dichotomous comparisons. We _ recognize, 


however, that, because of its systematic char- 
acter, certain other aspects are also present; 
principally, each case is not only greater than 


308 JOURNAL OF EXPERIMENTAL EDUCATION 


the next one, but it is at the same time less 
than the one on the other side. Also the 
series as a whole represents more than a single 
reference point: it contains one for each 
comparison. 

The wide use of dichotomous indications 
for measurement purposes is the more inter- 
esting when it is noted that they do not re- 
quire that the quantity represented by the 
standard of comparison be known. For ex- 
ample, we do not know directly the adjust- 
ment difficulty of a given temperature for 
amoeba, the difficulty of individual test items, 
the amount of water vapor in the air during 
some geologic era, the healthiness of an indi- 
vidual who has died, the absolute value of an 
esthetic or economic or other preference item, 
the strength of conviction of an individual 
voter, etc. All we know is that in certain 
cases one set of factors is stronger than the 
other set, one (either) set being taken as the 
standard. One’s concept of the standard is 
in many instances limited to the knowledge 
that in a certain proportion of the cases it is 
less than or greater than some other force. 

This brings us to the crux of our consid- 
eration of quantitative comparisons. If 
measurement is not (necessarily) a compari- 
son of an observed degree with a known and 
standardized quantity, expressed in terms of 
so many units, what significance does meas- 
urement have? The answer is that the sig- 
nificance of certain indicated degrees may 
grow out of direct personal experience with 
certain reference points, rather than out of a 
scale based upon equal units. Consider for 
example the somewhat homely, but neverthe- 
less pointed, illustration of a river dweller 
marking on a convenient tree the crest of vari- 
ous river floods. If this person had lived for 
a long time near Cincinnati, his highest mark 
previous to 1937 would be for the flood of 
1884. A little lower than that would be a 
mark for the flood of 1913; several feet be- 
low that would be the floods of 1907 and 
1933, and so on. Suppose each spring as the 
river rises, this person would watch its ap- 
proach toward one of these flood stages; and 
as it approached one or another of them, he 
would recall the preceding flood of that level, 
what it meant to him and his neighbors, the 
inconvenience, the suffering, the loss of prop- 
erty, the expanse of water across the river 
valley, etc. Such a person might live his en- 
tire life without ever knowing how high one 
of these flood marks was in terms of feet; 


[Vol. 5, No. 3 
and yet each of these marks might have much 
greater significance to him than the fact 
printed in the newspaper that the river was 
so many feet high would have to the average 
citizen who might not go near the river once 
in several months. 

Furthermore, the experience that is neces- 
sary in order that arbitrary reference points 
may take on significance does not require ex- 
perience with each possible quantity. It is 
frequently enough that one know what special 
significance attaches to certain critical de- 
grees, these degrees being evidenced in a va- 
riety of ways. A common test of hearing is 
the ability to hear a watch tick at a given dis- 
tance from the ear. It is unimportant that 
we do not know how many units of intensity 
there are in the tick of the average watch at 
that distance—all we need to know is that 
most people can or cannot hear it at a given 
distance. If we are measuring hearing with a 
4-A Audiometer and find that a given group 
of children all test normal, this finding is none 
the less significant because we may not know 
what “normal”’ is in terms of units of intens- 
ity. Or, if we find that a child has a hearing 
loss of fifty decibels how much are we en- 
lightened by the statement that a decibel is a 
unit for the logarithmic expression of ratios 
of power, and that one decibel is equivalent 
to the loss in electric power in a mile of stand- 
ard cable at 860 cycles? The really signifi- 
cant aspect of a loss of fifty decibels of hear- 
ing is that the child cannot hear well enough 
to learn speech by himself, and will have to 
be taught artificial speech. It appears that 
experience with quantities which are critical 
for a particular character or situation is, in 
many cases, all that is needed. 

It should be made perfectly clear that the 
biological and social sciences are not the only 
ones that make use of measurement without 
definite mathematical units being involved. 
For example, in mineralogy, to measure the 
hardness of minerals, Moh’s scale is used; this 
scale consists of ten minerals that constitute 
various degrees (not necessarily equal) of 
hardness. These minerals constitute a series 
of arbitrary (though standardized) rank val- 
ues. A specimen that will scratch number 
five (apatite) but fails t» scratch number six 
(orthoclase) is said to have a hardness of five, 
or between five and six. What such a rating 


means is entirely dependent upon one’s fa- 
miliarity with the properties of these stand- 
ard minerals—but it is satisfactory to those 


I ON IE 


ne Wee 


eC NOOO TTLOITE— WOE H Cm ANN NR IRE 8 


reed 





Mare 


who 

using 
light 
whicl 
the p 
In p 
ing f 
gethe 
gles | 
while 
frequ 
tiona 
light 
Sajou 
strai 
on tl 
larly 
the 

quen 
tion 
as n 
is m 
pulsi 
visu: 
the 

subj 
scat 
cal { 
nica 
stan 
the 


It 
mea 
sari 
amp 
man 
as \ 
gage 
func 
mea 
resi 
curt 
here 
hav 
of 
poil 
any 
cep’ 
wit! 
ere! 


Black 
or ex 
units 
know 
Jour 
point 


cited 





ee oe 


tn shes beeen 


ae 





March, 1937) 


who work in this field. In astronomy, when 
using the interferometer to distinguish two 
light sources very close together, as two stars 
which appear as one, the scientist compares 
the position of two sets of interference fringes. 
In physics, to compare the pitch of two tun- 
ing forks which are known to be very close to- 
gether, the forks may be placed at right an- 
gles and a beam of light passed through them 
while vibrating. If they are exactly equal in 
frequency, the beam of light will form a sta- 
tionary elipse; if they differ somewhat, the 
light will trace various figures, known as Lis- 
sajous figures, which include a circle and a 
straight line as special forms, depending up- 
on the degree of difference. Somewhat simi- 
larly the cathode-ray oscillograph is used for 
the comparison of very high electrical fre- 
quencies. In industrial work, in the construc- 
tion of machinery subject to great stress, such 
as machine tools, vibration in various parts 
is measured by converting the mechanical im- 
pulses into electrical ones and then into a 
visual periodic flame image, which is read by 
the operator without a scale and interpreted 
subjectively. These illustrations are but 
scattered examples of measurement in physi- 
cal fields which is based on more or less tech- 
nical comparisons with objective or subjective 
standards, and which is carried on without 
the necessity of typical units.* 


It may be appropriate to point out here that 
measurement without units does not neces- 
sarily lack precision. We may take for ex- 
ample a form of electro-limit gage used in the 
manufacture of certain metal products, such 
as wrist-pins for automobile engines. These 
gages are interesting from the standpoint of 
functional representation, for, instead of 
measuring distance directly, they reflect the 
resistance which a varying air gap offers to a 
current of electricity. We are concerned 
here, however, with their scale, which may 
have only a single gage point on it—the limit 
of tolerance. Any piece which causes the 
pointer to go beyond that mark is rejected; 
any variation less than that amount is ac- 
ceptable. These measures may be made to 
within a thousandth of an inch; a single ref- 
erence point does not necessarily mean crude 


*One may be interested in comparing the statements of 
Blackhurst, who insists, apparently without extensive analysis 
or examination, that all measurement involves units, that these 
units must be equal, and that a definite zero point must be 
known. Herbert Blackhurst, “Do We Measure in Education? ’’ 
Journal of Educational Research, XXVIII, 273-76. This same 
point of view is expressed by Kirkpatrick in the first reference 
cited in the present paper. 





HOW SCIENCE MEASURES 309 


measurement. These scales are, in practice, 
calibrated in terms of ordinary measurement 
with regular units, but it must be recognized 
that they need not be—they might be con- 
structed directly to fit the most satisfactory 
sized product without the intervention of unit- 
based measurement at any point. 

In considering further the possibility of 
measuring without having the value of the ref- 
erence point expressed in regular units, we 
should recognize that ranks have a certain sig- 
nificance that is wholly independent of scale 
values. Without having any detailed results 
of measurements based on scales, we can say 
that a child having a given percentile rank in 
any character in a particular group occupies a 
corresponding position to another child hav- 
ing the same percentile rank on that character 
in another group. And this relationship holds 
even though scaled measures would indicate a 
considerable disparity between the two. Or, 
if we care to utilize the relation between rank 
positions and scale values, we can count on 
this relationship holding from one group to 
another, within the limits of sampling fluc- 
tuations, so long as there are no factors which 
systematically affect one of the populations 
uniquely. 

We come finally to the question: Why do 
we ordinarily measure in terms of units that 
are equal, and standardized? If a single ref- 
erence point will suffice for many purposes, 
and if the value of reference points in terms 
of regular units need not be known in certain 
cases, why do we not stop there? The an- 
swer is that, for many purposes, measurement 
in terms of standardized units has a number 
of important advantages. First among these 
should be mentioned the fact that this type of 
measurement ordinarily carries greater signifi- 
cance to a larger number of people than do 
reference points which are unequal or which 
require unique experience. Referring again 
to the hypothetical river dweller, we may note 
that the significance of his flood marks was 
limited to him and to those who had shared 
his experiences; and while these may have 
been very vital experiences, the marks could 
not be communicated in a letter, nor could 
they be recorded on paper for posterity. The 
significance that comes with generalized units, 
and the resulting ease of communicability, are 
important characteristics of conventional 


measurement. 
These attributes must not however be 
Quantita- 


thought of as being without limit. 





310 JOURNAL OF EXPERIMENTAL EDUCATION 


tive statements expressed in units do not 
necessarily carry the desired amount of sig- 
nificance. Units are not inherently meaning- 
ful. Statements of quantity in any field of 
thought must be supplemented by experience, 
and a first-hand knowledge of critical values 
is important for interpretation in any Case. 
It must be postulated further that one have 
an understanding of the nature and magni- 
tude of the unit used. Few of us even know 
the value of a grain or a dram in our common 
avoirdupois system of weights, and when we 
mention troy weight or the metric gram, only 
specialists understand. Many units are pe- 
peculiar to a certain field or vocation, and 
carry no meaning whatsoever for those who 
are not familiar with measurement in that 
field. Other units—perhaps common enough 
in our experience—may not be familiar in cer- 
tain applications. For example, people who 
have been used to buying fruit by the dozen, 
or by the peck, do not take readily to buying 
by the pound: not that they are not familiar 
with a pound weight, but the unit in this new 
application is not related to a background of 
experiences and does not have the desired sig- 
nificance. 

It must further be recognized that while 
our entire number scheme can be spanned by 
intellectual generalization, it does not follow 
that values in different ranges of magnitude 
are equally significant to any individual. One 
may be familiar with several dollars but have 
no concept of several billion dollars; one may 
understand a mile through considerable ex- 
perience with it, and at the same time have no 
comprehension either of a light-year (6,000,- 
000,000,000 miles), or of a billionth of an 
inch. We cannot compiacently rely upon 
even well known units for conveying invari- 
ably a significant story. We are forced to 
conclude that, if in any case greater signifi- 
cance attaches to irregular reference points 
than to formal units, we may be making a 
mistake to use the latter. 

As a second set of important advantages 
growing out of conventional units we may 
point out that they facilitate scientific gen- 
eralizations. They fit into a whole scheme of 
quantitative thinking that embraces not only 
the character being observed, but all other 
characters that have been expressed in mathe- 
matical units. They afford, to a certain ex- 
tent, a common denominator for all quantities 
that can be so expressed, and permit inter- 
comparisons and tests for consistency that go 


|\Vol. 5, No. 3 
to the very heart of science. They are par- 
ticularly important in science which is con 
cerned with functional variation—commonl 
referred to as “laws.” It is difficult to con- 
ceive of any large body of quantitative science 
growing up apart from mathematically equal, 
relatively stable units. 

It is only reasonable, however, that her 
also we recognize certain limitations. The 
equality of units may, in many cases, be a 
conceptual or mathematical artifact. We de- 
rive our units by mathematical division of a 
more or less arbitrary magnitude. Whethe: 
the resulting “equality” is paralleled by any- 
thing in the objective universe is a question 
Quite commonly, a second aspect of the phe- 
nomenon being measured, or any function of 
the character being measured, proves not to be 
linear with the first aspect or function. Some- 
times, when it suits our convenience to regard 
equal units as unequal, we take some other 
function of them, such as taking logarithms 
of the scale values of a frequency distribution 
so as to bring it into approximately normal 
shape. The conclusion seems inescapable 
that the “equal” units which we commonly 
use — where they are available — are units 
which have been created for the purpose of 
convenience in fitting into a common number 
system, and not because nature is set up in 
that fashion. When using them for purposes 
of our science we should therefore bear in 
mind that we are working with something 
that we have selected to fit into our mental 
schematization of the universe rather than 
something that was devised as a particularly 
apt description of nature. 

One further point should be referred to in 
connection with scales. It is sometimes em- 
phasized that certain systems of measurement 
are seriously handicapped by virtue of lack- 
ing a true zero point for their scale. That 
is, the zero point on the scale does not cor- 
respond to the vanishing point of the charac- 
ter being measured. This lack is particu- 
larly characteristic of the psychological and 
biological measuring systems that are based 
on rank or on normated units, including dich- 
otomies which are enumerated. The issue 
does not appear to be as simple as it is us- 
ually stated. It seems in fact necessary to 
recognize four different kinds of zero points, 
and to consider the uses of each. 


First and most fundamental is the true 


zero of the character, sometimes called the ab- 
solute zero. It represents the lowest possible 





; 
- 
a 
q 


eC 





March 


value | 
is, it i 
ishes. 
ple, th 
at —2 
ity pre 
that 7 
intelli; 
cated 
ishes 
the c 
quant 
staten 
chara 
laws ; 
howe' 
impor 
case, 
mathe 
estab! 
scale. 
It 1 


illust 
eter 
used 
avail 
is se 
case 

Ot 
be fi 
tain 
cann 
at Ww 
extel 
tin! 
year 
or in 
origi 
stant 
urin, 
relat 
part 
state 
fron 

T 
catil 

*L 


ureme 














March, 1937] 


value which the character can approach; that 
is, it is the point at which the character van- 
ishes. In the case of temperature, for exam- 
ple, the absolute zero is located by calculation 
at —273° C, at which point all atomic activ- 
ity presumably ceases. It is such a zero point 
that Thurstone took initial steps to locate for 
intelligence.* A scale which has its zero lo- 
cated at the point where the character van- 
ishes affords certain advantages; it permits 
the calculation of ratios between different 
quantities, and it enables a simpler form of 
statement of the relationship between the 
character and other variables when scientific 
laws are established. It may be questioned, 
however, whether a true zero is of as great 
importance as are equal units; and in any 
case, the true zero is of little importance 
mathematically unless one has equal units, 
established on an independent measuring 
scale. 

It must be recognized that an arbitrary zero 
point is not uncommon. This is illustrated 
by the second type of zero point, which is 
simply a value of zero on the scale that is 
set opposite any arbitrarily selected value of 
the character. The amount of the character 
can go below this zero point, and neither the 
character nor the phenomenon necessarily 
vanishes at the point. Perhaps the simplest 
illustration is found in the ordinary thermom- 
eter scales, where arbitrary zero points are 
used in spite of an absolute zero value being 
available. The zero for the Centigrade scale 
is set at a significant or critical value in the 
case of water, but it is nevertheless arbitrary. 

Other instances of such a zero point are to 
be found in several fields—particularly cer- 
tain aspects of time and of distance, which 
cannot have any absolute or true zero point 
at which they cease to exist; time and space 
extend on indefinitely in an unbroken con- 
tinuum. In the case of the numbering of our 
years in chronological time (A. D. and B. C., 
or in the advent of any ruler or dynasty) the 
origin is arbitrary. Again, in the setting of 
standard time throughout a zone, or in meas- 
uring from sea level, we have origins that are 
relatively artificial, and hence arbitrary, in 
part because the actual conditions are in a 
state of constant flux. Longitude, or distance 
from, is referred to an arbitrary point. 

This second type of zero is of value in lo- 
cating a phenomenon relative to any desired 


*L. L. Thurstone, “‘An Absolute Zero in Intelligence Meas- 
urement,”’ Psychological Review, XXXV (May, 1928), 175-97. 


HOW SCIENCE MEASURES 311 


reference point. It can be applied to any 
scale or character, and often increases the con- 
venience of use. Probably our normated 
scales for intelligence and educational achieve- 
ment should be regarded as having this type 
of zero, though it must be recognized that we 
are not, in practice, nearly so much concerned 
with locating a value relative to zero as we are 
relative to the average. If and when the 
scales are extended to zero, however, the point 
would be this second type of zero. 

A third type of zero is found in ordinary 
measuring of the length of objects, or the 
elapsed time required for something to occur. 
In this case the zero on the scale is made to 
coincide with the zero point of the phenom- 
enon (the object or event being measured). 
That is, the zero of the scale is placed at the 
point where the phenomenon vanishes, or has 
its origin, though the character at large (as 
space, or time) does not vanish. In fact, this 
zero point may be taken at any point on the 
scale and made into a zero point by subtrac- 
tion, as is normally done with time when a 
stop-watch is not used. With reference to 
time and space at large, any such point must 
be regarded as somewhat arbitrary; yet it is 
logically located at the beginning of the phe- 
nomenon—just as an absolute zero is located 
at the beginning of the character—and prob- 
ably should not be called arbitrary. In fact 
one may conceive of this as an absolute zero 
if he considers that the character he is meas- 
uring is an interval of time or of distance, 
rather than time or distance itself. If we 
thus make the interval the variable character, 
it may be thought of as vanishing at the ori- 
gin, thus making the zero point an absolute 
zero. 

The fourth type of zero is involved when 
one measures an interval between two selected 
points on a given phenomenon, neither of 
which corresponds to the vanishing point of 
either the character or the phenomenon. In 
ethis case, nothing necessarily vanishes at the 
zero point of the scale except the interval. 
For example, one may measure the growth 
which occurs between the ages of six and six- 
teen. In this case the amount of growth ob- 


served to have taken place previous to the age 
of six is arbitrarily given a value of zero. 
This type of zero bears the same relation to 
the third type that the second type does to 
the first; the zero point is not taken at the 
beginning of the phenomenon, but at some 
point that may be more significant for the 





312 JOURNAL OF EXPERIMENTAL EDUCATION 


immediate purpose. It may be noted that 
both the third and fourth types of zero point 
give rise to ratios (when the units are equal), 
and that the fourth type is available to aca- 
demic and intelligence tests. That is, one 
may say that a child who is at ten years, in 
terms of normated units, has progressed twice 
as far across the interval from six to sixteen 
as the child who is at eight years. 

With this discussion of zero points we bring 
the final section of the paper to a close. 
While considering the third essential charac- 
teristic of measurement, namely, a basis of 
quantitative comparison, we have noted that 
there may be a variety of kinds of reference 
points, including various kinds of scales. In 
fact we find many examples of measurement 
apart from conventional scales. For practi- 
cal purposes, the chief value of measurement 
is to provide an element of description (a 
quantitative element) which carries signifi- 
cance; the unit and the scale should be chosen 
accordingly, in the light of the psychological 
background of the user. For scientific pur- 
poses, a scale that is metered to our number 
system—the common denominator of all units 
that are in some sense equal—and which has 
a zero point that logically means zero, is 
called for. We should not, however, be mis- 
led by the obvious advantages of this latter 
type of scale into thinking that, apart from it, 
there can be no measurement. Examples to 
the contrary are too numerous and too widely 
used to be ignored. 


[Vol. 5, No. ; 


In concluding, we may note that measure- 
ment takes a great variety of forms, and jis 
accomplished by a wide variety of means. 
Probably measurement is most commonly 
thought of as essentially equivalent to com- 
paring some object with a ruler. It has been 
the purpose of this paper to make clear that 
measurement is in many instances a highly 
complex process, and one which challenges the 
best insight and inventive ingenuity of the 
scientist. The requirements of measurement, 
in the abstract, are few—a working concept 
of the character to be measured, a satisfac- 
tory representation of quantitative variations 
in this concept as possessed or exhibited by 
phenomena to be measured, and one or more 
reference quantities to afford a basis for com- 
parison. In their applications, these require- 
ments take varied form in response to a great 
variety of conditions and purposes in differ- 
ent areas of science. Ideally, we should, in 
all cases, have an independent scale for meas- 
uring; units that are mathematically derived, 
permanently fixed, and universally familiar: 
and a zero point that has been located on the 
scale; but these advantages cannot always be 
obtained. Lacking these desiderata in many 
cases, science has found other ways of index- 
ing quantitative variation, and, notwithstand- 
ing the great difficulties encountered in many 
fields in meeting even the three primary re- 
quirements, science has continued to go stead- 
ily forward—ever measuring. 





ee ee 











meal 
purp 
diff = 
T 
shor 
the 
mat] 
forn 











Serene eve Tee 


A SHORT METHOD OF CALCULATING THE STANDARD 
ERROR OF THE DIFFERENCE OF THE MEANS 
OF PAIRED ITEMS 


A. S. Barr 
University of Wisconsin 
and 
C. N. MILLs 
Illinois State Normal University 


There has been a considerable amount of 
discussion in the literature of education rela- 
tive to the best method of calculating the 
standard error of the difference between two 


Validation of the Short Method 


Given the system of paired items @ and 6 
set up as follows: 














means. The formula generally used for this 

purpose is a b (b—a) | (b—a)® 
* oe o o oc o | | ; 

m/e Meare HD |S | E | ae | Baer 
The purpose of this paper is to exhibita | % bs (6; —a,) (6; —a,)* 

shorter method of making this calculation in : , | : . 

the case of paired items, and to present a | aq, fh | @emwasd | thecal 








mathematical validation of the method. The 
formula for this short method is 





The standard error of the means is 


g 1 . |S (b—a) |? 
M 1p] N ( a) N ) 


Co 





Note:  (6—a) means the algebraic sum. [M, — M,] = 
The paired items are ooo se ———— 
a, P b, Ue a. M,? —2 ru, M, M,; (IIT) 
az ’ b, , 
a; ’ b, M, is the mean of the a items, 
: : M, is the mean of the 5 items, 
a, : ba 


Yu, m, iS the coefficient of correlation be- 








De Er ay aD 


nee 


Short Method 


g ._ 7 
[M,— My) =29/ 83 — > 





tween the paired items. 


Using the @ and 6 items, we know that the 























= 1.32 
3 deviation of each item from the mean is 
Long Method Sa Sb 
ad | X,=4,-y , Y,— 5, an 
M,=1.521 , My=.76 , 75, =-.496 “f os 
ti = d. — Y.-= 5. — 
t= aes. ere 
{ (1.521)* + (.76)? — 2(.496) (1.521) (.76) — 1.32 
EXAMPLE a b | (b—a) (b—a)? | 
: 1 * ~ | 0 0 
1S. Ga, cee eee 
3 
Items 4 5 6 1 1 
5 —1 8 9 81 
6 7 7 0 0 
Total 38 49 } 11 83 
Mean 61/3 8 1/ | 1 5/6 : 

















314 JOURNAL OF EXPERIMENTAL EDUCATION [Vol. 5, No. 3 
oe ae eee But 3a* + 36° — 230b = ¥(b — a)’, 
i ie -  - and (3a)*— 23a3b + (36)? — 
. ‘ [35 — 3a]*= [3(b —a)]?. 
hi cas a Ed Therefore (IV) reduces to 
eo N ’ u 


The standard deviation of the a items is 


o sx? 7 ‘ (3a)? 
(= N “f * —_— WN : 











N 
also bas ae- xe EY 
sb? —_ 
“, “a ———£ - 
Then M, = VN ee _ er, 
also uM, = aa ar 
VN 


The coefficient of correlation between the 
a and 3b items is 





Sasb 
3XY  %b——y 
in m ———— = 
Sree 
4 ab 


If the sampling of the @ and b items has 
been made so as to satisfy the proper criterion 
for paired groups*, we may say that 


TM,Mp = Tav- 


Co o 
Substituting the values of M,, M,, and 


r,» in formula (III), gives 
. (3a)? 
‘M, — my) ={7 | 2° N ]+ 


2 (36)? 
7 [ = WV |-- 


(3a)* 
~ N 


\/ Sa’ 
N? N? 


Simplifying the above expression, we get 


























o I 
M,— M, = Wyse + 3b? — 23ab |— 
)% 
sl (3a)* — 280Xb + (36)*){ (IV) 
* Peters, Charles C., and VanVoorhis, Walter R.: Statis- 


tical Procedures and their Mathematical Bases. School of 
Education, The Pennsylvania State College, 1935, pp. 139-141. 





{[=(b— a) |? 


o -: aprrrs: 
M,—M, —1-/3(b—a) < 


which is identical to formula (II). In order 
to indicate successive teaching steps, the short 
method formula may first be written in the 
following form: 











Co 
[M.— M,] = 
: 3(b—a ‘ 
jf oo FEO —0 
N 
VN 
Conclusion. The short method formula 


offers both a simplified method of calculating 
the standard error of the difference between 
the means of two series of paired items, in ex- 
perimental research, and an easy check upon 
the long method when that is employed. 


REFERENCES 


Ezekiel, Mordecai: ‘Student’s” Method for 
Measuring the Significance of the Differ- 
ence between Matched Groups. Journal o/ 
Educational Psychology, Vol. XXIII, Sept., 
1932, pp. 446-450; Vol. XXIV, April, 
1933, PP. 306-309. 


Wilks, Samuel S.: The Standard Error of 
the Means of “Matched” Samples. Jour- 
nal of Educational Psychology, Vol. XXII, 
March, 1931, pp. 205-208; On the Distri- 
bution of Statistics in Samples from a Nor- 
mal Population of Two Variables with 
Matched Sampling of One Variable. 
Metron, Vol. IX, No. 3-4, 1932. 


“Student”: The Probable Error of a Mean. 
Biometrika, Vol. V1, 1908, pp. 1-25. 


Linquist, E. F.: The Significance of the Dif- 
ference Between “Matched” Groups. Jour- 
nal of Educational Psychology, Vol. XXIV, 
Jan., 1933, pp. 66-69. 


Fisher, R. A.: Statistical Methods for Re- 


search Workers. 


t 
| 





160° I< 2! OPN 


TE rt ae oe 











~ 
J 


er 
rt 
he 





) 


<P LET ALE RNS LEN 





THE ACCOMPLISHMENT QUOTIENT TECHNIC* 


EDWARD E. CURETON 
Alabama Polytechnic Institute 


I. INTRODUCTION 


The accomplishment quotient has been used 
for many years as a device for estimating pu- 
pil effort and teaching effectiveness. Its value 
for these purposes has often been greatly ex- 
aggerated, and occasionally unduly depre- 
cated. The argument that if we divide a 
child’s educational age by his mental age, we 
obtain a measure of the environmental aspect 
of his school achievement, has often and er- 
roneously been advanced in support of the 
technic. This argument rests on the unten- 
able assumption that a school achievement 
test measures a set of developed abilities, 
while an intelligence test measures the heredi- 
tary aspect of these abilities. 

An examination of the structure and con- 
tent of each of these types of test will reveal 
the error in the assumption. It is obvious 
that every test of intelligence, as well as every 
test of school achievement, is a measure of 
a set of developed abilities. The difference 
lies in the choice of abilities to be measured 
and in the method of devising items to meas- 
ure them. The general intelligence test, as 
its name implies, tries to measure general abil- 
ity. To do this it must include a wide vari- 
ety of mental tasks, including samples of all 
the more important types of mental operation 
and of symbolic content. The achievement 
test, on the contrary, limits its range of sam- 
pling to a relatively narrow and specific set 
of abilities. The symbolic content covered is 
fairly definite, and the range of mental opera- 
tions called for is well defined and not ex- 
tremely extensive. 

The author of an intelligence test devises 
items that he believes everybody, or nearly 
everybody, has had equal opportunities and 
incentives to learn. Insofar as he is success- 
ful, his test permits inferences to be made 
regarding the hereditary capacities which un- 
derlie the measured abilities. His success is 
limited, however, by the fact that, to him, the 
term everybody means almost literally what 

“A portion of this paper was read before the American 


Educational Research Association at the annual meeting in 
New Orleans in February, 1937. 


it says. He must attempt to devise items that 
educated and uneducated, wealthy and poor, 
urban and rural, professional and laboring 
people have had equal opportunities and in- 
centives to learn. No real approach to com- 
plete success is possible. But by sampling 
a wide variety of backgrounds and by pick- 
ing items with the greatest care, some ap- 
proximate success may be attained. A good 
intelligence test can measure with fair equal- 
ity individuals having rather widely different 
patterns of environmental background, but it 
cannot measure with any fair equality individ- 
uals who differ widely in the general intellec- 
tual level of these backgrounds. The failure 
of attempts to evaluate by means of tests the 
difference in native intelligence between 
American negroes and whites is a striking 
case in point. 

The author of a school achievement test 
has an easier task. Even if his test is to 
measure general achievement, his field is de- 
limited in terms of the common subject-mat- 
ter fields, and his items are confined to types 
which measure the important outcomes of in- 
struction in these fields. Within the limits 
just noted, his task is quite similar to that of 
the intelligence test builder. He too must 
attempt to construct items which everybody 
has had equal opportunities and incentives to 
learn. But for him, everybody means merely 
everybody in the same grade, or everybody 
in the same school, or at worst everybody 
whose general educational background is simi- 
lar in type and in extent. 


When we measure the children in a single 
class or school, we are limiting the range of 
educational backgrounds sharply. We are, to 
be sure, limiting the range of general back- 
grounds indirectly also, but this effect is prob- 
ably not nearly so marked as is the former. 
Within such a group, therefore, variations in 
achievement test score probably reflect dif- 
ferences in heredity at least as much as do 
variations in intelligence ‘test score. The 
hereditary elements, however, are not all the 
same in the two cases. A much wider range 


315 





3106 JOURNAL OF EXPERIMENTAL EDUCATION 


of them presumably operates to affect intelli- 
gence test scores, and there are also undoubt- 
edly some few elements which effect achieve- 
ment test scores but not intelligence test 
scores. Hence the accomplishment quotient, 
as it is commonly used, reflects a variety of 
factors. The school environment affects the 
educational age and to some extent the men- 
tal age. This last effect is much more impor- 
tant in group tests than in the Binet. The 
general environment affects the mental age 
and to a lesser extent the educational age. 
Many hereditary factors affect both educa- 
tional and mental ages, some affect educa- 
tional age but not mental age, and some af- 
fect mental age but not educational age. 
What the AQ really tells us then, is whether 
the child’s school background, school effort, 
school attitude, and specific hereditary scho- 
lastic intelligence are in general superior or 
equal or inferior to his general background, 
general intellectual attitude and effort, and 
general (exclusive of scholastic) hereditary 
intelligence. 

In practical use, the accomplishment quo- 
tient is beset by a variety of random and sys- 
tematic errors in addition to the errors in in- 
terpretation just noted. The magnitude of 
its standard response error, derived some 
years ago by Huffaker’, is seldom realized by 
the average worker. When we divide an un- 
reliable educational age by an unreliable men- 
tal age, we obtain an accomplishment quo- 
tient which is much more unreliable than 
either. This consideration is chiefly respon- 
sible for Kelley’s recommendation that the 
accomplishment quotient be abandoned alto- 
gether’, but this extreme view, as we shall see, 
is not necessary. 


Il. EXPERIMENTAL DATA 


In the present investigation, Form V of the 
New Stanford Achievement Test, Advanced 
Examination, Form A of the Otis Group In- 
telligence Scale, Advanced Examination, and 
the Stanford-Binet Intelligence Scale were 
given to each of 83 pupils in Grades 7 and 8 
of the Lee County High School in Auburn, 
Alabama. All mental ages were corrected to 
the date on which the achievement test was 
given by assuming constancy of the IQ over 
the short periods involved. In computing the 
1Q’s as well as the EQ’s of a half-dozen-odd 
children who were more than 14 years 4 


* Raised numerals refer to references in Appendix IV. 


[Vol. 5, No. 3 


months of age, the extrapolated chronological 
age scale of the New Stanford Achievement 
Test was used. This provides a regular 
gradation of “equivalent chronological ages’, 
from 14-6 at CA 14-7 to 15-8 at adult CA. 
It was felt that this procedure would give a 
closer approximation to the true 1Q’s of these 
older children than would the device of using 
actual CA up to 16, and 16-0 thereaiter, as 
well as giving EQ’s and IQ’s having strictly 
comparable meanings at these higher ages. 


Three accomplishment quotients were com- 
puted for each pupil. The first of these used 
the Stanford-Binet mental age as the denom- 
inator, and the second used an estimate of 
mental age based on the Otis point-score. 
Otis states, however, that for pupils of a given 
chronological age, the variability of mental 
ages obtained from his test is greater than 
that obtained from the Binet test. He recom- 
mends a special technic based on differences 
for obtaining a valid estimate of the Binet 
IQ from this test. The IQ’s obtained by his 
technic were multiplied by the corresponding 
chronological ages at the achievement test 
date to give so-called Otis Equivalent mental 
ages. The third accomplishment quotient for 
each pupil was computed by using the Otis 
Equivalent mental age as the denominator. 
The AQ so obtained would of course be ex- 
actly equal to the AQ that would result from 
dividing the New Stanford Achievement EQ 
by the Otis IQ determined as Otis recom- 
mends. The equivalent mental ages were re- 
quired in any event for the standard response 
error computations. For convenience in dis- 
tinguishing between these three accomplish- 
ment quotients, we will call the first the Binet 
AQ, the second the Otis AQ, and the third 
the Otis Equivalent AQ. The educational age 
obtained from the New Stanford Achievement 
Test was used as the numerator in all three. 
For the purposes of this study the two grade- 
groups were combined into a single group of 
83 pupils. 

In order to obtain the standard response 
errors of the three sets of AQ’s, it was neces- 
sary to find the reliability coefficients of the 
educational and mental ages. This was done 
by scoring the odd and even items separately, 
correlating, and applying the Spearman- 
Brown formula. In the case of the Binet, 


the greatest possible care was used to allocate 
verbal, numerical, and visual items as nearly 
equally as possible to each half-scale at each 


eT 








Mar 


Star 
Bine 
Otis 
Otis 


of t 
are 
atte 














% RFRA COP ROI Tae as 27 r 


March, 1937) 


ACCOMPLISHMENT QUOTIENT TECHNIC 317 


TABLE I 


MEANS, STANDARD DEVIATIONS, RELIABILITY COEFFICIENTS, AND INTERCORRELATIONS OF 
EDUCATIONAL AND MENTAL AGES (N = 83) 


M ¢ ru R,, || Stan. Binet Otis Otis Eq. 
ial cn icaiettacieeeiseoassibib estes asi ii 175.5 23.6 96 .98 | 88 94 91 
SS OSS A aa, sR EIS 174.1 24.4 81 89 | 82 .90 88 
ro RED ES 189.9 262 .93 .96 || 91 .84 82 
I, iced entiiys 178.5 17.7 95 97 || 89 #82 79 


The values labelled R:: are Spearman-Brown reliability coefficients. 


of the double vertical line are intercorrelations. 
are raw correlations. 
attenuation. 


age-level, and to do the same with reference 
to judgment, memory, and _ sensori-motor 
items (See Appendix I). The basic data are 
given in Table I. The comparatively low 
reliability of the Binet mental age is note- 
worthy. If this were due to any gross error 
in the application of the split-scale technic, 
we should expect the correlation between the 
Binet and the New Stanford Achievement 
Test, when corrected for attenuation by the 
use of this coefficient, to be unduly high. The 
value .88 is well below the values .94 and .g1 
for the Otis and Otis Equivalent. There is 
therefore excellent internal evidence that the 
application of the split-scale technic is justi- 
fied, and that the comparatively low reliabil- 
ity of the Binet mental age is genuine. All 
the Binet tests were given by the same exam- 
iner, it should be noted, over a three-month 
period just previous to the time when the Otis 
and the New Stanford Achievement tests were 
given (the latter about a week apart)*. 

In addition to a relatively large random re- 
sponse error, the ordinary AQ is affected by 
systematic errors. Since the correlation be- 
tween educational age and mental age is ap- 
preciably less than unity, regression effects en- 
ter in to disturb it. As a result of these ef- 


*I am greatly indebted to my wife, Ruth Duncan Cureton, 
not only for giving all these tests, but even more for encour- 
agement and constructive suggestions throughout the course 
of the study. 


Entries to the right 
Those below and to the left of the diagonal 


Those above and to the right of the diagonal are corrected for 


fects, we should expect to find a positive cor- 
relation between EQ and AQ, and a negative 
correlation between IQ and AQ, provided the 
mental and educational ages have equal means 
and standard deviations. Part of this regres- 
sion effect is due to errors of measurement in 
the two tests, and part to fundamental differ- 
ences between the two functions measured. 
In order to avoid this difficulty, various writ- 
ers have proposed that we re-define the ac- 
complishment quotient.° Instead of using 
mental age itself as the denominator, we 
should apply a regression equation, and use 
instead the educational age estimated from 
the mental age. This regression-accomplish- 
ment quotient (or RAQ) would then be de- 
fined as educational age divided by estimated 
educational age. 


A still different variation of the accomplish- 
ment quotient is suggested if we wish to elim- 
inate regression effects due to errors of meas- 
urement in the tests, but not those due to 
fundamental differences between test-intelli- 
gence and test-achievement. In this case, we 
may employ regression equations based on the 
reliability coefficients to obtain an estimated 
true AQ (or AQ... ), defined as estimated true 
EA divided by estimated true MA. The 
equations of definition of the three varieties of 
accomplishment quotient are as follows: 





_ 


A 
1 


AQ = 


4 
> 


(1) 


EA 





RAQ = 


_— Rea, eaEA > 


bea,ma MA + (Mega— bea, ua Mua) 
(1 — Rea, va) M ex 


« €2) 





A = 
Q 4 Rya, waMA+ (1 — Ryua, wa) M ya 


. (3) 


318 JOURNAL OF EXPERIMENTAL EDUCATION 


The symbol R, in the last formula, denotes 
a reliability coefficient for a total test, ob- 
tained usually by the use of the Spearman— 
Brown formula. 

The standard response error of the ordinary 
accomplishment quotient (1) has been de- 
rived, as was earlier noted, by Huffaker’. A 
new derivation is presented in Appendix II, 
together with derivations of the standard re- 
sponse errors of the regression-accomplish- 
ment quotient (2) and the estimated true ac- 
complishment quotient (3). 

The expected correlations (positive for EQ 
and AQ, and negative for IQ*and AQ), to- 
gether with the distribution of the AQ, are 
still further disturbed if the means and stand- 
ard deviations of mental and educational ages 
are not equal*. By referring to Table I it 
may be seen that the Binet mental ages are 
only a little more variable than the New Stan- 
ford educational ages, the Otis mental ages are 
very much less variable and the Otis Equiva- 
lent mental ,ages are very much less varable. 
The mean Binet MA is only slightly below the 
mean EA, while the mean Otis Equivalent MA 
is appreciably higher and the mean Otis MA 
is much higher. The Binet is best on both 
counts. The Otis Equivalent is next best 
with reference to the mean, and worst with 
reference to the standard deviation. The 
New Stanford EA is comparable to the Binet 
MA (means and standard deviations equal 
within their respective sampling errors), but 
not to the Otis MA nor to the Otis Equivalent 
MA. It is worth noting in this connection 
that the lowest of all the intercorrelations is 
that between the Otis and the Otis Equivalent 
mental ages. 


* This matter is taken up in some detail in Appendix III. 


[Vol. 5, No. 3 


J 


III. RESULTS AND DISCUSSION 


Table II gives the fundamental data re- 
garding the nine varieties of accomplishment 
quotient. The first two columns give the 
means and standard deviations of the actual 
distributions of AQ’s as computed. Note that 
the means of the three based on the Binet are 
all within one point of 100, while in the case 
of the Otis and Otis Equivalent, only the 
mean RAQ’s are this close. The mean RAQ 
is always 100 within the grouping and round- 
ing-off errors. The standard deviations of 
the three based on the Binet are the largest, 
due no doubt to the lower reliability of the 
Binet mental ages. The AQ’s based on the 
Otis Equivalent MA’s are more variable than 
those based on the Otis MA’s, probably be- 
cause the Otis Equivalent AQ’s are not com- 
parable in meaning to other AQ’s, as will 
shortly be apparent. 

The third column of Table II gives the 
standard response errors of the various AQ’s. 
The standard response error of the accom- 
plishment quotient is an estimate of the 
standard deviation of the distribution of AQ’s 
which we should expect to obtain if we gave a 
large number of pairs of comparable forms of 
intelligence tests and achievement tests to one 
individual. It is a special case of the stand- 
ard error of an individual score, the score 
here being the AQ. Strictly speaking, the 
standard response error here reported is in 
each case the standard response error of an 
individual AQ equal in magnitude to the 
mean AQ. It is to be interpreted with ref- 
erence to the standard deviation of the dis- 
tribution of actual AQ’s. Each standard re- 
sponse error is therefore expressed in standard 
deviation units in column 4. All these rela- 


TABLE II 


MEANS, STANDARD DEVIATIONS, STANDARD RESPONSE ERRORS, RELATIVE STANDARD RESPONSE 
ERRORS, AND CORRELATIONS WITH IQ AND EQ OF VARIOUS ACCOMPLISHMENT QUOTIENTS, 
TOGETHER WITH THE NUMBER OF SIGNIFICANTLY POSITIVE AND NEGATIVE AQ’s 
IN A SET OF 83 


M o 
eS ne 101.4 8.61 
EE ere 99.1 7.83 
REST 100.9 7.95 
IID sincaitiadaisiees insiemacteoningis 92.8 5.39 
SE” oi.ces sce Stns coalepeaiieaioes 99.4 5.29 
ee 92.9 5.18 
OE ae 98.1 6.49 
Otis Eq. RAQ ------------ 99.6 6.15 
ke Pe 97.9 6.58 


€ €/o T 9,10 Trgeq + 2.5% — 2.5% 
5.00 .58 — .54 — .02 5 11 
4.10 52 — .08 .20 12 10 
4.32 54 — .62 32 6 10 
3.02 56 — AT — .56 14 8 
3.04 57 — .12 22 13 10 
2.88 56 — .38 — .56 14 10 
2.48 .38 59 82 20 16 
2.75 45 40 -68 20 18 


2.39 36 -60 83 19 16 


+ 
{ 


SE en ¥ Capea ets 


TIE AUST LI TF 


Se ae ee ee 











Se 





wre 


sayeth ¥ oppo 


ee ce cat 


% i eb aates 


ie 8 os ce 


March, 1937] 


tive standard response errors except those in- 
volving the Otis Equivalent AQ’s have nu- 
merical values between .52 and .58. Since 
the derivations of the standard response error 
formulas assume bona fide mental ages, it is 
doubtful if the lower values for the Otis 
Equivalent indicate anything important other 
than the inapplicability of the formulas to 
these equivalent mental ages and accomplish- 
ment quotients. 

If we assume that the response errors in 
AQ’s near the mean are distributed normally, 
we can set approximate significance limits. 
Adopting the rather severe 5% criterion (2.5% 
at the top and 2.5% at the bottom), we find by 
referring to a table of the normal probability 
integral that the corresponding x/o points are 
at + 1.96. Multiplying 1.96 by e, and adding 
and subtracting from the mean AQ, we obtain 
the upper and lower limits of significance. 
Columns 7 and 8 give the number of pupils 
(out of 83) above and below these limits. It 
should be noted that if we multiply the medi- 
an of the six ¢/o values (exclusive of those 
involving the Otis Equivalent), .56, by 1.96, 
we obtain the value 1.0976, which is just 
greater than unity. For all practical pur- 
poses, therefore, in a group such as ours, we 
may consider any AQ more than one standard 
deviation above the mean as significantly 
positive, and any AQ more than one standard 
deviation below the mean as significantly 
negative. 

As a check on the normality of the distri- 
bution of AQ’s we may note that the expected 
number of cases out of 83, beyond x/o—= 
1.0976, is 11.3. This value is to be compared 


with the values in columns 7 and 8 of Ta- 
ble IT. 


Column 5 gives the correlations between 
AQ and IQ, and column 6 the correlations be- 
tween AQ and EQ. It has already been 
noted that the former should be negative and 
the latter positive, if the mean and standard 
deviation of the EA’s are equal respectively 
to the mean and standard deviation of the 
MA’s. In the case of the RAQ, its correla- 
tion with IQ should be close to zero.* The 
extent of the disturbances due to unequal 
means and standard deviations can be judged 
from the irregular values of these correlations. 
The Binet once again is best, as was to be ex- 
pected on the bases of the closer correspond- 
ence of its mean and standard deviation with 

* See Appendix IIT. 


ACCOMPLISHMENT QUOTIENT TECHNIC 319 


those of the New Stanford Achievement Test. 
Since the AQ’s based on the Otis Equivalent 
MA all show substantial positive correlations 
with IQ, as well as still higher correlations 
with EQ, it is apparent that they do not have 
the properties expected and desired in the AQ. 
When using the Otis test for the purpose of 
computing AQ’s, therefore, the AQ should be 
defined as EA/MA and the IQ as MA/CA, 
as is done with other tests. 


If the AQ is to be interpreted as the ratio 
of school background, effort, attitude, and in- 
telligence to general background, effort, atti- 
tude, and intelligence, it would seem that for 
maximum clarity in such interpretations the 
correlations between AQ and IQ and between 
AQ and EQ should be as low as possible (par- 
ticularly the AQ, IQ correlation). If we accept 
this criterion, the regression-accomplishment 
quotient (RAQ) is definitely superior to both 
the ordinary AQ and the estimated true AQ 
(AQ~). The estimated true AQ, further- 
more, is not demonstrably superior to the or- 
dinary AQ. This is of course the condition 
we should expect if both the intelligence test 
and the achievement test were highly reliable. 
With the tests used in the present study it 
shows only that the factor of unreliability is 
not large in comparison with the factor of 
basic differences between test-intelligence and 
test-achievement as a cause of variability of 
the AQ,—a hopeful and noteworthy finding 
which can be checked by comparing the raw 
and corrected EA,MA intercorrelations in 
Table I. The RAQ possesses the additional 
advantage that it means most nearly what the 
non-technical worker believes any AQ signi- 
fies—namely the ratio of test-achievement to 
the achievement predicted by an intelligence 
test. 


A major obstacle to the widespread use of 
the RAQ is the fact that it is difficult to com- 
pute. The average teacher is not prepared to 
set up and solve a regression equation. The 
only solution that suggests itself is for the 
authors of tests to prepare special tables. In 
the New Stanford—Binet Scale, just recently 
made available, we have a logical starting 
point. The regression of some appropriate 
(comparable) achievement test upon the new 
Binet could be determined for several large 
groups, and the results smoothed. As long 


as comparability exists, there is a direct rela- 
tion between the standard deviation (the aver- 
age, in the case of a single class, of og, and 





320 JOURNAL OF EXPERIMENTAL EDUCATION 


oma) and the regression. A set of tables, or 
perhaps better nomographs, could then be pre- 
pared, to be entered with EA, MA, and a, 
from which the RAQ could be read directly. 
In the opinion of the writer, the value of such 
tables or nomographs would easily repay the 
effort involved in their construction. The 
regression-accomplishment quotient, particu- 
larly if based on the new and apparently much 
more reliable Revised Stanford—Binet Scale, 
will undoubtedly prove to be a useful measure, 
in spite of its rather large response error. It 
at least makes possible the identification with 
reasonable certainty of extreme cases, and 
from the practical standpoint it is precisely 
these extreme cases that really matter. 


IV. THE EFFEctTs oF SCHOOLING 
Upon VARIABILITY 


A question of recurrent interest relates to 
the claim that the American public school 
“pushes” or over-stimulates dull children, 
while neglecting or under-stimulating bright 
ones. Kelley has reported evidence* tending 
to show that children in higher grades are 
basically more alike in school achievement 
than are children in lower grades. It has 
often been assumed that the accomplishment 
quotient might throw some further light on 
this problem, since the condition if it exists 
must operate to produce a true negative cor- 
relation between IQ and AQ. It has so far 
proved impossible, however, to untangle any 
such correlation from the regression effects 
and the effects of unequal means and standard 
deviations. Another approach is possible in 
terms of mental and educational ages, how- 
ever. Since both of these variables are ex- 
pressed in similar and presumably equivalent 
growth-units, the assumed condition should 
result in lowering the true variability of edu- 
cational ages as compared with mental ages 
in a group of children all of the same chrono- 
logical age. It will be necessary to compare 
estimated true standard deviations or vari- 
ances rather than raw standard deviations or 
variances of educational and mental ages, in 
order to compensate for disturbances due to 
unequal reliabilities of the educational and 
mental tests. The estimated true variance 
of either form of a test (or of a half-test) is 
equal to the covariance of the two comparable 
forms (or half-tests). A derivation of the 
standard error of the ratio of two such covari- 
ances is presented as a part of Appendix II. 


|Vol. 5, No. 3 


This ratio should be significantly less than 
unity if the hypothesis holds. 


We do not here have a group of children of 
anywhere near the same CA, and selective 
factors in grade-placement should presumably 
operate to make children in the same grades 
more alike in test-achievement than in test-in- 
telligence. In our group of 83, the value of 
the ratio o’g,. /o’ma~iS 1.06 for the Binet 
mental ages, with a standard error of .14. 
For the Otis mental ages the ratio is .84 and 
the standard error is .o8. No computation 
was made for the Otis Equivalent mental ages, 
since these are not growth-units at all. It is 
doubtful if the above results have any signifi- 
cance other than to show the necessity for 
using larger groups. 


V. CONCLUSIONS 


1. The accomplishment quotient can not 
be interpreted as a direct measure of school 
drive or effort, or of teaching efficiency. In 
a single class or school, the achievement test 
is as much affected by heredity as is the in- 
telligence test, and the intelligence test is as 
much affected by environment as is the 
achievement iest. The heredity and the en- 
vironment involved are different in the two 
tests, however. Therefore the AQ actually ex- 
presses more or less crudely the ratio of school 
effort, attitude, background, and native scho- 
lastic capacity to general effort, attitude, back- 
ground, and native general intelligence. A 
knowledge of this ratio should be an important 
item of information regarding any child in 
whom it is especially large or small. 


2. An AQ based on the Binet test should 
be much more meaningful than one based on 
a group test, since the Binet succeeds much 
better in measuring the general (as contrasted 
with the scholastic) background, attitude, ef- 
fort, and native capacity. 


3. For maximum usefulness and meaning- 
fulness, the AQ should have a zero or near- 
zero correlation with IQ, and as low a corre- 
lation with EQ, in addition, as is possible. 
The regression-accomplishment quotient 
(RAQ), defined as EA divided by estimated 
EA, comes much the closest to meeting both 
of these requirements. It can only be useful 
in practice, however, if the authors of tests 
prepare special tables or nomographs to as- 
sist the teacher or non-technical worker in his 
computations. 








Mar 


ual 
devi 
are 


(or 
two 
muc 
(or 

abo 
sign 
cati 
cast 
abl 
Bin 


upc 


Yes 
Ba: 
Vil 


Vil 


xX 


xX" 








ot 


March, 1937) 


4. The AQ (or the RAQ) possesses its us- 
ual meaning only if the means and standard 
deviations of the mental and educational ages 
are equal or approximately equal. 

5. The standard response error of the AQ 
(or of the RAQ) is small enough so that in a 
two-grade range (which in this case is not 
much greater than a one-grade range) an AQ 
(or RAQ) more than one standard deviation 
above or below the mean may be considered 
significantly positive or negative. Identifi- 
cation of appreciable numbers of extreme 
cases is therefore possible. With more reli- 
able tests (such as the Revised Stanford— 
Binet), still better results should be obtained. 

6. The problem of the effects of schooling 
upon variability cannot be solved by correlat- 


ACCOMPLISHMENT QUOTIENT TECHNIC 321 


ing 1Q and AQ. If an achievement test and 
an intelligence test are scaled in comparable 
growth-units, a solution is suggested in the 
comparison of estimated true variances of 
educational and mental ages in a group of 
children all of the same chronological age. 
If the hypothesis that schooling makes chil- 
dren more alike is true, the educational ages 
should be significantly less variable than the 
mental ages. 

7. (Added Note) li we define the RAQ 
not as EA divided by estimated EA, but in- 
stead as EQ divided by estimated EQ, the 
requirement of 4, above, is no longer neces- 
sary. This, therefore, should be the defini- 
tion of the RAQ finally to be recommended. 
(See Appendix III). 








“Even Scale” 
Same as total 
Differences 
Fingers 

Ties bow knot 


Ball and field 
Comprehension 
Similarities 

Date 

Repeats backwards 
Rhymes 


Absurdities 
Reading and report 
Comprehension 


Abstract words 

Ball and field 
Repeats 5 backwards 
Similarities 
Induction 

President and king 
Arithmetic reasoning 


Differences 
Repeats 6 backwards 
Code 


Repeats thought 
Repeats 7 backwards 


APPENDIX I 
DIVISION OF THE STANFORD-B°NET SCALE INTO EQUIVALENT HALF-SCALES 
Year Credit “Odd Scale” 
Basal age Same as total 
; 4 2 Pictures 
NEEL Munacceuasetiueeonn 4 3 Repeats 5 digits 
4 6 Copies diamond 
, 4 6 Vocabulary 
WEE. Sndeneeiresand dbs ebeaee 4 2 Counts backwards 
4 5 Definitions 
; 4 2 Weights 
SU. duckotndhineteisnnsakansaaeadan 4 3 Makes change 
4 5 Three words 
: 1 Vocabulary 
Th. sespancn silica tegen dct 4 3 Designs 
6 Sixty words 
6 1 Vocabulary 
pO PCED: Aas 6 4 Dissected sentences 
6 5 Fables 
6 7 Pictures: interp. 
8 1 Vocabulary 
BEY. ainbtge den ido ee ~ 4 Problems of fact 
8 6 Reverse clock 
10 1 Vocabulary 
| FS i ee 10 2 Fables 
10 3 Enclosed boxes 
12 1 Vocabulary 
NE See 12 2 Paper cutting 
12 3 Repeats 8 forward 


avb Aw WHOM WAWH CTRNW ALK LOK HPO 


Ingenuity 


GENERAL PRINCIPLES OF DIVISION (in the order of precedence) 


1. A test once allocated to one half-scale must always be allocated to that half-scale. 
2. Vocabulary to the “Odd Scale”, and the next-best verbal test at each age to the “Even Scale”. 


3. Equal division of tests of known high validity and of doubtful validity as measures of general 
intelligence. 


4. Equal division of verbal, numerical, and visual tests. 
5. Equal division of reasoning, memory, and sensori-motor tests. 





322 


APPENDIX II 
STANDARD ERROR DERIVATIONS 


Let e be an individual EA, 
m an individual MA, 


and @ an individual AQ, 
then 
a—=e/m (4) 
Let o, be the standard response error of 
the EA, 
o,, the standard response error of 
the MA, 
Fon the convariance of response er- 


rors in EA and MA, assumed equal 
to zero if tests are given on dif- 
ferent days, and 

o, the standard response error of 
the AQ, to be derived. 


Let og be the standard deviation of the 
EA’s, 
om the standard deviation of the 
MA’s, 
Reg the Spearman—Brown reliabil- 
ity of the EA’s, 
Ryu the Spearman—Brown reliabil- 
ity of the MA’s, 
and rey the correlation between EA’s 


and MA’s. 


Then it is well known® that the standard re- 
sponse error of a single score, taken as evi- 
dence of a true score, may be written (refer- 
ring to EA’s or MA’s), 


a. =o", (1 — Rez) (5) 
oe .= ou (I — Ry») (6) 


The reliability coefficients in these formulas 
will always be computed by the Spearman— 
Brown formula, whether the original correla- 
tions are between split-halves of a single test 
or comparable forms. In either case the 
whole test (or the sum of the scores on 
the two forms) will be used in obtaining 
means, standard deviations, and intercorrela- 
tions, as well as in computing individual ac- 
complishment quotients and their constituent 
educational and mental ages. Any other pro- 
cedure would be tantamount to throwing away 
half of the available data. 


Taking the logarithmic differentials of (4), 
a ad 
a e€ m ~ 


JOURNAL OF EXPERIMENTAL EDUCATION 


[Vol. 5, No. 3 
Squaring, summing over the theoretically in- 
finite population of successive forms of the 
tests, and dividing by this theoretically infinite 
number of forms, 

oa oe o m = Tem 


, f° a “em” 





Remembering that o,,, is assumed equal to 

zero, and substituting from (5) and (6), we 

have after reducing, 

o=a og (1 — Ree) , ox (1 — Ry») - 
. e m? , 








Standard response error of an individual AQ. 


If we replace a by , e? by M*z, and m? by 


M?*, to adjust the fetmnale to the case of an 

AQ equal in numerical magnitude to the av- 
erage AQ, we then have, 

_ Mz 

e Ms 


og 


M*s 





o 


——— (1 — Ree) 


Rum) 





1, 





(8) 





Standard response error of an AQ at the average. 


If the intelligence test and the achievement 
test are comparable, so that og = ou =o, 
and My—My=M, this formula reduces 
still further to, 


0, = 1—Re— Run. (9) 


Standard response error of an AQ at the average 
for the case of comparable tests. 





These last two formulas were derived first by 
Huffaker*, by a — which did not lead 
first to (7). 

The sceaiiiiinenohenengtiiantint quotient 
may be written, 


a e 
ome ee 





(10) 


where b = bgy, and c= M,—6bMy. Tak- 
ing logarithmis differentials, squaring, sum- 
ming over all forms, and dividing by the num- 
ber of forms, 


os o b?e* bo 


J" tao ‘ale +c). 





ELLE RAI IU 








Mar 


Noti 
and 
ducil 


Reg 
(M 


sim 


not 
for 


tie 








SW 


the 
lite 


[0) 


ak- 


im- 





2ENGED ERTS E > 


SEE 





on tS TOTEM al 


March, 1937) 


Noting again that o, is taken equal to zero, 
and substituting from (5) and (6) and re- 
ducing, 








td = (1 — Ree) 
, “| % 
b?¢? 
+ Um ge cyt (E— Rum) ; (11) 


Standard response error of an individual regres- 
sion-accomplishment quotient. 


Replacing e? by M’*p, m by My, and c by 
(M,; — 6M), we find that (6m +c) reduces 
My 


tMy +c 1. Also 


simply to Mg, and a = 


b? = r’gyo*z/o’m, and substituting in (11) 
and simplifying, 





OE 





ie — Ree — ** ex (1 — Ryu). (12) 


CC = 
a Mer 


Standard response error of a regression-accom- 
plishment quotient at the average. 


The assumption of comparable tests, i.e. the 
assumption that My, = My and og = ow, does 
not result in any useful simplification of this 
formula. 


The estimated true accomplishment quo- 
tient may be written, 


ae, /m... (13) 
Here e.. is an estimated true EA and m,, is an 
estimated true MA. The regression equation 
for an estimated true EA is, 


€ = be. ve + Me —bz. eMe = be + Ce. 


Now, by ce = Rew ore / on; 

and since it is generally known’ that R*2,. 2 = 
Reg, and o*g.. = ore = Rero*sz, we find that 
on substituting and reducing, 


bg oE= Ree, 
and noting that the best estimate of Mg . is 
simply Mg, 


Ce == My (1— Rg). 


ACCOMPLISHMENT QUOTIENT TECHNIC 


323 


A precisely similar line of reasoning holds 
with respect to m.., so that substituting in 
(13) we obtain, 


ve Reré + Cr 
<= Ryym + Cx 
Ree + Me (1 — Ree) 

Ruwm + My (1 — Rum) * 


a 





(14) 


Substituting for o*, and oy in (5) and (6), 
we obtain, 
ae op (Ree — R’gg). 


oc” = oy (Ruw — Ryu). 


rr 
N oo 


(15) 
(16) 
Taking the logarithmic differentials of (14), 


squaring, summing over all forms, and divid- 
ing by the number of forms, 








o. R* pxo", R?® yuo", 
(René + Ce)* + (Ruwm + Cy)? 
ReeRuvo,,,, 





ai 2 (Rexe + Ce) (Ruwm + Cm) ~° 


Since o ,,, =o by assumption, the last term 


drops out, and then substituting from (15) 
and (16) we obtain, 


c= i. oR’ ee (1 — Rex) 
a |i Rene + Me (1 — Ree){? 
o* wR? um (1 — Ryo) : ts 
}Ruwm + My (1 — Rym)!? 








(17) 


Standard response error of an individual esti- 
mated true AQ. 


Replace e by Mg, m by My, and a by 
ReeMy + Mg (1 — Rez) —_ Me 
RywMy + My (1 — Rum) My © 





Then, 
M -~ 2 


2 % 
a (R® uu — Rv . 
M 


Standard response error of an estimated true 
AQ at the average. 





m+ 





(18) 


If tests are comparable, so that My — My 
and CO; = Om, 


324 JOURNAL OF EXPERIMENTAL EDUCATION 





“1 / Ree — Ree + R’uu — Rtux-(19) 


Standard response error of an estimated true 
AQ at the average for the case of comparable 
tests. 


The function used to test the hypothesis 
that schooling reduces variability is, 


OE OEE 
———— ee (20) 
OM « OoMM 

Taking logarithmic differentials, squaring, 


summing over all samples, and dividing by 
the number of samples, 





2 i o o 
ud T Ope FT vim Cor 7m 
— a ; —?—— > 
Oo EE oO MM OEEOMM 
Now it is known* that, 
N O gp = 871% 2 + Fy, aN dA C8 44 


F430 24 + Fy 4025. 


Substituting in the equation above and re- 
ducing, 


ORE 1 1 


= ; 
rE "MM 





On = 


~ 


oun VN . 


a 
‘ 72 
2 Tem’reM’ + Teme’ 
Ter’ MM 


Standard error of the ratio of two estimated true 
variances, or standard error of the ratio. of two 
covariances, 





(21) 





It is to be noted that this is not a standard 
response error, but a sampling error formula. 
All correlations dealt with are correlations be- 
tween half-tests, or between single pairs of 
equivalent forms. The primes in the last 
term denote the “even-scale” or Form—B edu- 
cational and mental ages. Throughout the 
rest of the formula, as in previous ones, it has 
been unnecessary to distinguish between the 
“odd-scale” and the “even-scale”. If the two 
forms of the educational achievement test are 
equivalent, and if the two forms of the mental 
test are also equivalent, then we may assume 
that Tem = lew’ = lew = eM: In this case, 
it is not necessary to compute all four of these 
intercorrelations. We may instead compute 
only the one, 7, 2) «4 ’), Which is equiva- 
lent to the rey of previous notation, and will 
be so designated. Then, 


[Vol. 5, No. 3 





— CEE 4 4 I a I 
hac —— ee 
F OMM VN lee age 
__ Pew (1+ rae) (1 raed TT . (22) 





Teruo 
Standard error of the ratio of two estimated tru 
variances, or of two covariances, when the two 
forms of each test are equivalent. 
The notation of this formula coincides exactly 
with that defined at the beginning of this 
Appendix. 

Finally it should be noted that throughout 
this Appendix standard errors (both sampling 
and response) have been designated by the 
symbol o with appropriate subscripts, while 
in Table II, the symbol « has been used. 


APPENDIX III 
CORRELATION OF AQ wiTtH IQ anp EQ 


The necessity, in general, for negative cor- 
relation between IQ and AQ has been dis- 
cussed by various writers. As early as 1923, 
Toops and Symonds’ showed this fact by a 
geometric discussion of the regression of EQ 
on IQ. This discussion and later ones by 
various other writers have been persistently 
misinterpreted. Even Douglas and Huffaker, 
in the paper in which they offer an algebraic 
proof of the same point’, criticize Toops and 
Symonds in a manner which suggests that 
they do not follow the geometric argument. 
In the present paper, therefore, we repeat the 
Douglas—Huffaker algebraic proof and ex- 
tend it. 

Holzinger? has provided the basic approach 
in his formula for the correlation between 
ratios, which we may write, 

"str. PP PP Pes PR 


—r VV) =[(V?,+¥2,—2r,, VV) 
(7*, +V?,,—2r,V .V,)*] ’ (23) 


where V=o/M. If we let t= IQ and e = 
EQ, we find on substituting in (23) that, 
a V. 
Gh —,V,V)% 
The only point to note in making tnis sub- 
stitution is that y is to be replaced by unity. 
Therefore all correlations with y will equal 
zero, as will also the value of V, . The value 





(24) 


of (24) will obviously be negative unless 
r,, V, exceeds V, in numerical value. This 


obviously will happen only when 7, is fairly 


ee 














Ma 


high 
and 
able 


sho 


[he 
is f 
and 
mal 
are 


be 


pr S 


cor 
As 
cre 
bec 
tiv 
rh 
tiv 
bo 


bri 
CO) 


an 








4) 


y. 
al 
ue 


SS 
1is 














March, 1937)\ 


high and when in addition «, exceeds g, 
and/or M, exceeds M, by a fairly consider- 
able amount. 

By a similar substitution in (23) it is easily 
shown that, 

Vi—r. 
fa, a perme AS (25) 
(V pts ig ve arin 2r_,| - » . 


the value of (25) will be positive unless r_, 
is fairly high and in addition o, exceeds o, 
and/or M, exceeds M, by some substantial 
margin. If means and standard deviations 
are equal, the correlation of AQ and IQ will 
be negative and the correlation of AQ and EQ 
positive. As 7, approaches unity, the AQ 
correlations approach each other in value. 
As o, increases relative to o, and/or M, in- 
creases relative to M_, both AQ correlations 
become higher in the algebraic sense (nega- 
tive correlations tending to become positive). 
lhe Otis Equivalent AQ’s of the present study 
illustrate this point. As o, increases rela- 
tive too, and/or M, increases relative to M, , 
both AQ correlations become lower alge- 
braically (positive correlations tending to be- 
come negative). 

If we now let e = EA, m= MA, b=), , 
and @== RAQ, we may define this RAQ by 
the equation, 

e e 


“= bm + M, —bM, B° yn 





The correlation between mental age and RAQ 
may be obtained from (23) as before. 
VV V Ve 


oS = t om e m a TmB 


"B a Vv, ( eos V* 3—2r,pV Vp) al (27) 





To evaluate this formula we must first evalu- 
ate V., 7,5, and 7,,. It is easily ascer- 
tained by direct expansion of B as defined in 
(26) that M,—M,, and o,—r,,¢,, 
whence V',—r,_,, V,. By similar but more 
lengthy expansions, using these values, we find 
that r,, —=1 and r,,—r,, . Substituting 
in (27) and reducing, we find that, 


Tm == 0, (28) 


By a similar line of reasoning we find that, 


Fei == (I — Pom). (29) 


ACCOMPLISHMENT QUOTIENT TECHNIC 


325 


Stating these findings in words, the correlation 
between RAQ and MA is zero, and the corre- 
lation between RAQ and EA is equal to the 
alienation coefficient of EA and MA. Note 
that these results are absolutely independent 
of any specific relations among the means and 
standard deviations. 


The RAQ might be defined as EQ divided 
by estimated EQ. This definition is not in 
general equivalent to defining it as EA di- 
vided by estimated EA, in the sense that the 
ordinary AQ may be defined as, 


1Q — EQ EA/CA_ EA 
~<  1Q  MA/CA MA* 


With this new definition of the RAQ, let e 
EQ, i =1Q, b= 6,, ,anda==RAQ. Then, 


e ‘ 
a = — =<. (30) 
c 


Substituting in (23), 
rVVi—nVV, 


et e t 


"SV (V2, +V2,—ar, VV.) * 





(31) 


To evaluate this formula we require the val- 
ues of V.,7,-,andr,.. By direct expansion 
and substitution we find that M. —M_, 
to=,,0,,Vo=r,,V.,%c¢ =1,andro= 


et 


r,,. Substituting in (31) and reducing, 
re—=0, (32) 
and similarly, 


Tec =(1—Fr*, 


)*. (33) 


The analogy between (28) and (29) on the 
one hand and (32) and (33) on the other 
hardly requires any comment. Since the 
correlations of this RAQ with IQ and EQ do 
not depend on relations among the means and 
standard deviations, it is quite probably more 
useful than the first. To clinch this argument 
it is only necessary to show that if we define 
the RAQ as EA divided by estimated EA, its 
correlations with IQ and EQ are more com- 
plex affairs. 


Let e = EA, m= MA,c==CA,b6=—65,, , 


and a == RAQ = e/B. Equation (26) defines 
B. From (23), 


326 JOURNAL OF EXPERIMENTAL EDUCATION 


, , , 
5 Foal r } mT TB V. V —_— ' mB V V B 


—r VV (VV? +? et VV md Ye 
(V?,+V*, —2r,.,V.V_)*. (34) 


cr 


To evaluate this expression we require V,,, 
’ps'mp andr, . We have noted already 
« m e 

that VV, =7.,V. 57,3 =1,andr,, =? 


e em * 
Expanding B in the Pearson formula for r_,,_, 
we find that 7, —T,,,, Substituting these 
values in (34) and reducing, we obtain, 
f“,2_. 7 — tin 
a fae J c ais © om f2 5 


(V? + Vv? —2r..VV,,)* (1— a”, %. (35) 


This correlation will equal zero only if 
' =",, , and if this is not the case its 


r 
cm em 
value is evidently dependent upon a variety 
of different relationships. 

We may also obtain the correlation between 
RAQ and EQ by a similar line of reasoning. 


Toe _— Vea _—s ~ ) + V. i ee fe = 


(1—r*,) 2 (V?_ + V?_— ar VV). (36) 
From these last two equations it is clear that 
for simplicity in meaning and freedom from 
interpretative restrictions, the RAQ defined as 
EQ divided by estimated EQ is superior to the 
one defined as EA divided by estimated EA. 
That its superiority is not, perhaps, as marked 
as a first inspection of the formulas might 
imply, may be inferred by inspection of Ta- 
ble II. 


If we define ¢ as EQ in Formula (10) of 
Appendix II, and take m as IQ, Formulas 


[Vol. 5, No. 3 
(11) and (12) give the standard response er- 
ror of the RAQ when it is defined as here rec- 
ommended. In this case all the original cor- 
relations are computed with the quotients 
rather than the age-scores. 


APPENDIX IV 
REFERENCES 


1. Douglas, H. R., and Huffaker, C. L., “Cor- 
relation between intelligence quotient and 
accomplishment quotient”. J. Applied 
Psy., 13, 76-80 (1929). 

2. Holzinger, K. J., “Formulas for the corre- 
lation between ratios”. J. Ed. Psy., 14, 
344-347 (1923). 

3. Huffaker, C. L., “The probable error of 
the accomplishment quotient”. J. Ed. Psy., 
21, 550-551 (1930). 

4. Kelley, T. L., The Influence of Nurture 
Upon Native Differences. Macmillan, 
1926, 49 p. 

. Kelley, T. L., Interpretation of Educa- 
tional Measurements. World Book Co., 
1927, 363 Pp. 

6. Peters, C. C., “A method for computing ac- 
complishment quotients on the high school 
and college levels”. J. Ed. Research, 14, 
99-111 (June—Dec., 1926). 

7. Toops, H. A., and Symonds, P. M., “What 
shall we expect of the AQ?” J. Ed. Psy., 
13, 513-528 (1922). 

8. Wishart, J., “The generalized product mo- 
ment distribution in samples from a nor- 
mal multivariate population”. Biometrika, 
20, 33-52 (1928). 


unr 


1 RNS TR a Rt IRE LC ERO : 





.. 
é 
& 
e 








